From 1b0a15f569d10fd0e07e646190468acafc5d1d40 Mon Sep 17 00:00:00 2001 From: Nathan Ellingwood Date: Mon, 8 Apr 2024 09:58:38 -0600 Subject: [PATCH] Master release 4.3.00 (#2163) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * HIP: since Kokkos has moved it out of experimental we should clean up Just reflecting the move of HIP and HIPSpace out of experimental so that we do not get deprecation warning and even failures down the road. This was really done in Kokkos Core 4.0.0 so it is time to catch up... * Applying clang-format * Sparse: fix cusparse spgemm hang properly The issue is fixed by disabling the TPL in spec_avail when a problematic version of CUDA/cuSPARSE is being used. * Sparse: fix logic for bad cursparse spgemm version. Just inverted the logic statement to enable the TPL when it is known to work correctly. * Improvements on the unification attempt logic for axpby(), including new tests * Addressing feedbacks from Luc, plus some small changes here and there: In KokkosBlas1_axpby_unification_attempt.hpp: - Removed unnecessary variables, routines, and checks - Imposed terminology consistency: variable names begin with lower case letters, type names begin with upper case letters - Using static_assert as much as possible - Using 'public' and 'private' keywords accordingly - Improved some explanations and error messages In KokkosBlas1_axpby_spec.hpp: - Replace 'a' and 'b' by 'scalar_x' and 'scalar_y' where appropriate, to keep consistency with the terminology used in the 'impl' and 'mv_impl' files of the axpby operation. - Not using the 'KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY' define anymore. Code is now consistent with the 'old' value 3 for such define. In KokkosBlas1_axpby_impl.hpp and KokkosBlas1_axpby_mv_impl.hpp: - Not using the 'KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY' define anymore. Code is now consistent with the 'old' value 3 for such define. - Using 'if constexpr' whenever possible - Checking that -1 <= scalar_x <= 2 and that -1 <= scalar_y <= 2 - Replaced '} else {' by '} else if (scalar_x == 2)' or by '} else if (scalar_y == 2)', whenever possible - Improved error messages - Improved explanation headers a bit In KokkosBlas1_axpby.hpp: - Renamed some variables to more meaningful names * Formatting * Using 'ifdef HAVE_KOKKOSKERNELS_DEBUG', per Luc's suggestion * Addressing feedbacks from Luc * Correcting compilation errors in my Mac * Backup * SYR2: fix unit-test type issue On KokkosEco_Trilinos_Weaver_CUDA112_opt-uvm the SYR2 test enerates a compile time error probably due to a mixed use of host and device views when comparing implemented vs. reference results. * CUDA 11.0.1 / cuSPARSE 11.0.0 changed SpMM enums * SYR2: applying clang-format * CUDA 11.2.1 / cuSPARSE 11.4.0 changed SpMV * KokkosBlas1_axpby: include for debug builds Resolve compilation errors in debug mode: "error: no member named 'cout' in namespace 'std';" * Backup * Backup * Backup * Backup * Backup * Backup * Backup * Backup * Backup * Backup * Backup * Backup * Address CI build errors * Some cleanup on current pull request, making it more related to 'just' the creation of the lapack subdirectory and the moving of some files to there * More cleanup * Re-enabling gesv unit tests under the lapack subdirectory * Adding BLAS routines back, for backwards compatibility * Formatting * Small cleaning * Correcting error in Jenkins * Fixing compilation error on Jenkins when dealing with HIP * Add required rtd conf file * README.md: Use correct project slug * docs/requirements.txt: Add sphinx-rtd-theme * Addressing latest feedbacks from Luc. * Formatting * KokkosKernelsConfig.cmake: add all_libs target and necessary aliases * Intent of these changes is to allow for building Trilinos with KokkosKernels as an external TPL * hide native merge-path SpMV behind "native-merge" * test native-merge algorithm * Quick fix for night compilation with Trilinos * SPTRSV: check if cusparse is available before calling TPL path Since SpTRSV does not implement the TPL layer the usual way we need to be extra careful before calling the TPL implementation path. If cusparse is not available then we definitely want to revert back to calling the native implementation. Similarly, if the execution space is not Kokkos::CUDA, let's use the native implementation. * SpTRSV: more strickly check prerequisites in SptrsvHandle Check that CUSPARSE is enabled and that HandleExecSpace is Kokkos::CUDA before allowing users to set the implementation to use the CUSPARSE TPL. * SpTRSV: fix some type definition and variable usaged for cuSPARSE Since we are guarding the cusparse path a bit better we need to be careful when some types are defined and to mark some variables (void) when they do not get used by an implementation... * SpTRSV: applying clang-format * SpTRSV: more fixes * SpTRSV: apply clang-format * SYCL: fix for Trilinos build with MKL * Apply clang-format to non-cmake files * SYR2: fix issue with bad type in test function After comparing various function signatures and view types, the change allows tests to pass correctly and seem correct based on input params. * Update Test_Blas2_syr2.hpp Fix mistake in host/device view argument * LAPACK: adding rocsolver TPL Adding the necessary CMake logic and TPL layer to support rocsolver for LAPACK. Enabling the TPL in gesv and updating gesv test to run by default the more common configurations and only run specific ones when the associated TPL (MAGMA) is enabled. * Lapack: change according to Brian's review The SpaceAccessibility of IPIVV needs to be modified for MAGMA. The value_type of IPIVV needs to be rocblas_int when running with rocSOLVER. The types used for gesv_tpl_spec_avail and the actual TPL instantiation where mismatched leading to linker error. * cmake/Dependencies.cmake: remove ROCSOLVER Removing ROCSOLVER to prevent configuration errors with Trilinos Will bring back when support is added in Trilinos for ROCSOLVER as TPL * Lapack: cusolver TPL logic and support for gesv Adding CMake logic to support cusolver and implementing gesv using cusolver getrf and getrs. Unit-test is passing without problems! * Lapack: updating logic in cm_generate_makefile for cusolver There is some specific TPL logic in cm_generate_makefile and it cannot be found for cusolver, changing that might to the trick! * Backup * Backup * Backup * Formatting * mv_unification tests with double are failing by very small amounts, e.g. 5.9e-14 vs. 3.6e-14 * Trying one more increment on tolerance * Putting pragma's and unrolls properly right before for loops (compilation warning at weaver) * Giving it another try to larger tolarance, after fixing the warning on pragma and unroll * Lapack: gesv, implementing review commments * Adding Changelog for Release 4.2.0 (#2031) * Adding Changelog for Release 4.2.0 Part of Kokkos C++ Performance Portability Programming EcoSystem 4.2 * Formatting the changelog a bit more Mentioning more clearly LAPACK vs BLAS, grouping PRs by logical work unit, etc... * Remove minor revisions, improve text descriptions * Changelog: add spmv perftest detail --------- Co-authored-by: Luc Berger Co-authored-by: Carl Pearson Co-authored-by: brian-kelley * NRM1: refactoring TPL layer a bit with c++17 if constexpr Hopefully this leads to simpler code, less duplication, less macro and easier maintenance! Adding support for oneapi MKL while making tpl layer changes. * BLAS: Nrm1 implementing Brian's feedback * Blas: nrm1, fix in tpl spec decl * BLAS: nrm1 problems with ExecSpace template and lack of Kokkos::Threads Fix issue with Kokkos::Threads and Kokkos::HIP * Another attempt while waiting to get access to the solo cluster * Formatting * Correction error from the last commit * Fixing the error that was happening only at the solo cluster * Increase tolerance a bit more * ncreasing tolerances in all 4 locations * Backup * Backup * Formatting * Forgot to add ClusteringAlgorithm:: at some spots * Formatting * Lapack: fixing issue with Magma TPL in gesv, trtri, etc... Adding proper support for MAGMA after having it moved to the Lapack directory and checking it does not create issues with cuSOLVER. * Update blas/unit_test/Test_Blas1_swap.hpp Co-authored-by: brian-kelley * cmake: Add workaround check for CUSOLVER support with Trilinos TPL_ENABLE_CUDA default enables CUBLAS and CUSOLVER in Trilinos, but not CUSPARSE This PR modifies the TPL requirement checks to maintain compatibility with existing configration options of Trilinos Attempt to resolve/workaround issue #2047 * Addressing Brian Kelley's feedbacks * Formatting * Removing 'ClusteringAlgorithm::' * Lapack: gesv, incorporate Brian's feedback * Applying clang-format * Fixing some deprecation warnings/errors for ROCm 6 * BLAS: fix bug in TPL layer of KokkosBlas::swap The cuBLAS Kokkos::complex specialization had a small bug where the rank of the view was not specified correctly! * CMake: fix bugs in deciding KOKKOSKERNELS_TPL_BLAS_RETURN_COMPLEX * TPL: revise BLAS1 dot implementation * Fix compile errors for C-linkage dot functions returning std::complex * Use a C struct for complex numbers to avoid error: '_Complex' is a C99 extension [-Werror,-Wc99-extensions]. * Add a workaround by disabling host MKL dot with complex numbers * Allow KokkosKernels_ENABLE_PERFTESTS=ON to build perf_tests without KokkosKernels_ENABLE_TESTS=ON * format sparse/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp * cmake: fix tpl check so cusolver can be disabled when needed * Link std::filesystem for IntelLLVM in perf_test/sparse * gemm3 perf test: user CUDA, SYCL, or HIP device for kokkos:initialize * Fix for rocm_verison header inclusion * fence Kokkos before timed interations * Deprecate KOKKOSLINALG_OPT_LEVEL * Add CMake warning message if KokkosKernels_LINALG_OPT_LEVEL is used * Async matrix release for MKL >= 2023.2 * Support CUBLAS_{LIBRARIES,LIBRARY_DIRS,INCLUDE_DIRS,ROOT} and KokkosKernels_CUBLAS_ROOT * KokkosSparse_spmv_impl_merge.hpp: use capture by reference Resolve warnings in builds with c++20 support enabled: "kokkos-kernels/sparse/impl/KokkosSparse_spmv_impl_merge.hpp:166:81: warning: implicit capture of 'this' via '[=]' is deprecated in C++20 [-Wdeprecated]" * KokkosSparse_par_ilut_numeric_impl.hpp: use capture by reference Resolve warnings in builds with c++20 support enabled: "kokkos-kernels/sparse/impl/KokkosSparse_par_ilut_numeric_impl.hpp(591): warning #2908-D: the implicit by-copy capture of "this" is deprecated" * Backup * Backup * Backup * Backup * Formatting * Correcting compilation error * Typo * Changes for syr and syr2, to be tested at weaver * Formatting * Changes for axpby * Backup * Formatting * Just to force new checking tests in github * Addressing feedback from Luc. * Don't call optimize_gemv for one-shot spmv * Add HIPManagedSpace support - CMake option for ETI - Run unit tests with a Kokkos::Device, not just Kokkos::HIP - Like we do for Cuda - Still use HIPSpace unless Managed is the only enabled memspace - Couple of minor fixes - Allow querying free HIPManagedSpace memory for SpGEMM - Disable VBD coloring (not a huge deal, had to do same on CUDA) - Use correct memory space in SpTRSV solve * Backup * Backup * Backup * Minor typo * Add block support to all SPILUK algorithms (#2064) * Interface for block iluk * Progress. Test hooked up * Progress on test refactoring * More test reorg * Fix test * Refactor spiluk numeric a bit with a struct wrapper * Add good logging * progress * Fix block test * Progress but potential dead end * Giving up on this approach for now * progress * Make verbose * Progress * Progress * RP working? * Progress on TP alg * Bug fix * Progress on template stuff * Progress on block TP * Progress * Get rid of all the static_casts * More cleanup. Steams now support blocks * Tests not passing * Serail tests all working, both algs, blocked * Remove output coming from spiluk test * Final fixes for CPU * Cuda req full template specification for SerialGemm::invoke * Don't use scratch for now * Formatting * Fix warnings * Formatting * Add tolerance to view checks. Use macro and remove redundant test util * Fix for HIP * formatting * Another test reorg to fix weirdness on solo * formatting * Remove unused var * Github feedback * Remove test cout * formatting * Zero-size arrays can cause problems * Fix unused var warning * Add CUDA/HIP TPL support for KokkosSparse::spadd (#1962) * spadd: change arguments to ctor of SPADDHandle add a default value to input_sorted; add a second argument input_merged to indicate unqiue entries; So that we can easily know whether we can use TPLs on the input matrices * spadd: add cuda/rocm TPL support for spadd_symbolic/numeric * Make spiluk_handle::reset backwards compatible (#2087) * Make spiluk_handle::reset backwards compatible By making block_size default to -1, which means don't change block size. * Switch default val for block_size for reset_handle * formatting * Fix comment * spadd: add APIs without an execution space argument (#2090) * Lapack - SVD: adding initial files that do not implement anything (#2092) Adding SVD feature to Lapack component, the interface is similar to classic Lapack and the implementation relies on the TPL layer to provide initial capabilities. The TPL supported are LAPACK, MKL, cuSOLVER and rocSOLVER. Testing three analytical cases 2x2, 2x3 and 3x2 and then some randomly generated matrices. * Hands off namespace `Kokkos::Impl` - cleanup couple violations that snuck in (#2094) * Do not use things from namespace Kokkos::Impl (Kokkos::{Impl:: -> }ALL_t) * Do not use things from namespace Kokkos::Impl (Kokkos::Impl::DeepCopy) Can achieve the same with Kokkos::deep_copy * Fix warning `declaration of ‘std::size_t n’ shadows a parameter` * Change name of yaml-cpp to yamlcpp * Fix macro setting in CMakeLists * GMRES: Add support for BSR matrices Also, add a test for this. * Remove all mentions of HBWSpace * Reintroduce EXECSPACE_(SERIAL,OPENMP,THREADS}_VALID_MEM_SPACES Drop HBWSPACE as an option * Lapack: adding svd benchmark Fixing unit-test for CUSOLVER and adding benchmark to check the algorithm performance on various platforms. * Fix Cuda TPL finding (#2098) - Allow finding cusparse, cusolver based on manually provided paths - This is necessary when using an nvhpc toolchain instead of a standard cuda toolchain - Set header paths correctly (this is redundant in a cuda installation, in which $CUDA_ROOT/include is already a system include dir, but needed in other cases) * Add support for BSR matrices to some trsv routines (#2104) * Add support for BSR matrices to some trsv routines * Change trsv to gesv * Lapack - SVD: adding quick return when cuSOLVER is skipped (#2107) Currently we still run the tests on U, S and Vt which does not make sense since we actively skip this test because cuSOLVER does not support more columns than rows... * Fix build error in trsv on gcc8 * Add a workaround for compilation errors with cuda-12.2.0 + gcc-12.3 (#2108) On Perlmutter@NERSC, I met this error /usr/lib64/gcc/x86_64-suse-linux/12/include/avx512fp16intrin.h(38): error: vector_size attribute requires an arithmetic or enum type typedef __half __v8hf __attribute__ ((__vector_size__ (16))); The workaround was mentioned at https://forums.developer.nvidia.com/t/including-cub-header-breakes-compilation-with-gcc-12-and-sse2-or-better/255018 * Lapack - SVD: fix for unit-test when MKL is enabled (#2110) This is really a problem with our implementation of the BLAS interface when MKL is enabled since MKL redefines the function signatures of blas functions using MKL_INT instead if int... * Revert "Merge pull request #2037 from ndellingwood/remove-rocsolver-optional-dependency" (#2106) This reverts commit 5a36d577e725546062af3b297eec87e23a40ab58, reversing changes made to 2c66d291f9b5512e17f9375304902b6ba42133b2. * Fixing missing inclusion in source file * BLAS - MKL: fixing HostBlas calls to handle MKL_INT type (#2112) MKL redefines the BLAS interface based on how MKL_INT is defined we need to wrap that definition with our own Kokkos Kernels INT type to make both compatible with regular BLAS. applying clang-format * Fix weird Trilinos compiler error It seemed to have a problem with these deep_copies, so just do the copy by hand like it was being done before my recent trsv PR. * Update changelog * Update changelog * Block spiluk follow up (#2085) * Fix for gemm * Remove unused divide method * Enhancements to spiluk test * Progress. Block spiluk now checks out against analytical results * LUPrec test with spiluk woring * Disable spiluk LU test on non-host * Enhancements to spiluk test * Clean up a few issues uncovered by gh review * github workflows: update to v4 (use Node 20) * Refactor Test_Sparse_sptrsv (#2102) * Refactor Test_Sparse_sptrsv * More cleanups * Remove old commented-out code * CMake: error out in certain case (#2115) Graph unit tests are unique in that they use default_scalar for the KokkosKernelsHandle. So if test-eti-only is ON, but neither float nor double is instatiated, then error out for the graph unit tests. Users can still build without float or double if they want, but only if they turn off tests or the graph component. * Wiki examples for BLAS2 functions are added (#2122) Some small additional change the the function headers themselves to add some missing header file inclusions. Applying clang-format Removing constexpr since it won't happen before some work in Core. * Increase tolerance on gesv test (Fix #2123) (#2124) And uncomment the verbose output for when tolerance is exceeded, since that helps debug this sort of issue. This is only printed at most once so it won't spam the output if the entire vector is wrong. * Spmv handle (#2126) * spmv handle, TPL reuse * using handle in unification layer and hooking up new algorithm enums with old Controls options * Update spmv_merge perf test Compare KK merge vs. default and KK native * Small changes to help text of spmv_merge perf test * Complete backwards compatibility with Controls interface - copy over spmv algorithm selection correctly - copy expert tuning parameters * Controls spmv: accept other name for bsr algo * bsr spmv test: disable tensor core It was not actually being run before due to a different name actually enabling it (experimental_bsr_tc rather than experimental_tc) * Disable OneMKL spmv for complex types oneapi 2023.2 throws error saying complex isn't supported * OneMKL: call optimize_gemv during setup * Option to apply RCM reordering to extracted CRS diagonal blocks (#2125) * Add rcm option when extracting diagonal blocks * Update kk_extract_diagonal_blocks_crsmatrix_sequential * Add test for extracting diagonal blocks with rcm * Update RCM checking * cm_test_all_sandia: various updates - updates for blake * cm_test_all_sandia: drop decommissioned/unavailable machines - remove voltrino, mayer * Fix2130 (#2132) * Fix #2130 - Do not call BsrMatrix spmv impl if block size is 1 - Instead, convert it to unmanaged CrsMatrix and call spmv again - cuSPARSE returned an error code in this case - Better performance * Formatting * Remove redundant remove_pointer_t Handle is already a non-pointer type * Benchmark: modifying spmv benchmark to run range of spmv tests (#2135) This could be further automated to run on matrix from suite sparse * Kokkos Kernels: update version guards to drop old version of Kokkos (#2133) Since we are now in the 4.2 series we only support up to 4.1.00. Older version of Kokkos Core will require older version of Kokkos Kernels for compatibility. Once 4.3.00 is out we will move to drop support for the 4.1 series and only keep 4.2 and 4.3 series. * ODE: BDF methods (#1930) * ODE: adding BDF algorithms Implementing BDF formula for stiff ODEs. Orders 1 to 5 are available and tested. The integrators can be called on GPU to solve multiple systems in parallel. * ODE: fixing storage handling for start-up RK stack * ODE: clang-format * ODE: first adaptive version of BDF The current implementation only allows for adaptivity in time, at this point the BDF Step actually converges as expected with first order integration! * ODE: fixing issues with adaptive BDF The unit-test BDF_adaptive now shows the integration of the logistic equation using adaptive time steps and increasing integration order from 1 to 5. * ODE: running BDF on StiffChemistry problem The problem runs fine and is solved but there are oscillations while the behavior of the solution is smooth. More investigation is needed... * BDF: fixing types and template parameters in batched calls Bascially we need template parameters to be more versatile and cannot assume that all rank1 views will have the exact same underlying type, for instance layouts can be different. * More fixes for GPUs only in tests this time. * ODE: BDF adaptive, fix small bug After adding rhs and update vectors to temp the subviews taken for other variables need to be offset appropriately... * Revert "More fixes for GPUs only in tests this time." This reverts commit 2f70432761485bc6a4c65a1833e7299dd2c340e2. * Revert "Revert "More fixes for GPUs only in tests this time."" This reverts commit 836012bb529551727b3f5913057acad94dfe60df. * ODE: BDF small change to temporarily avoid compile time issue True fix involving a KOKKOS_VERSION check is upcoming after more tests on GPU side... * ODE: BDF fix for some printf statements that will go away soon... * ODE: adding benchmark for BDF The benchmark helps us monitor the performance of the BDF implementaiton across multiple platforms as well as impact of changes over time. * ODE: improve benchmark interface... * ODE: BDF changes to use RMS norm and change some default values Small changes to compare more closely with reference implementation. Some of these might be reverted eventually but that's fine for now. * ODE: BDF convergence more stable and results look pretty good now! Changing the Newton solver convergence criteria as well as changing a few default input parameters leads to a more stable algorithms which can now integrate the stiff Henderson autocatalytic example well in 66 time steps instead of 200k for fixed order integration... * ODE: BDF fix bug in initial time step calculation The initial step routine was overwriting the initial right hand side which led to obvious issues further down the road... now things should work fine. Need to figure out if I can re-initialize the variables in the perf test while excluding that time from each iteration. * ODE: BDF removing bad print statement... std::cout in device code * ODE - BDF: improving perf test Basically adding new untimed setup within the main loop of the benchmark to reset the intial conditions, buffers and vectors ahead of each iteration. * Modifying unit-test to catch proper return type * Applying clang-format * cm_test_all_sandia: update caraway compilers add rocm/5.6.1 and rocm/6.0.0, and openblas/0.3.23 as tpl * Sparse MKL: changing the location of the MKL_SAFE_CALL macro (#2134) * Sparse MKL: changing the location of the MKL_SAFE_CALL macro Moving the macro outside of namespaces to ensure that it will be interpreted correctly when called from any other location in the library. It does not make much sense to guard Impl code in the Experimental namespace and in this case it cleans up a problem with namespace disambiguation for the compiler... * Sparse BsrSpMV: removing Experimental namespace from Impl namespace * Applying clang-format * Sparse SpMV: fixing more namespace issues! * Fixing missing descriptor for bsr spmv * Kokkos Kernels: change the default offset ETI from size_t to int (#2140) This change makes it easier for customer to leverage TPL support which almost always requires offset=int, ordinal=int to be enabled meaning that no TPL support is available with our default ETI... * KokkosSparse_spmv_bsrmatrix_spec: fix Bsr_TC_Precision namespacing Resolve compilation errors in nightly cuda/12.2 A100 build * Drop comment for cleaner clang-format fix * Fix usage of RAII to set cusparse/rocsparse stream (#2141) Temporary objects like "A()" get destructed immediately. For the object to have scope lifetime, it needs a name like "A a();". This was causing cusparse/rocsparse spmv to always execute on the default stream, causing incorrect timing in the spmv perf test. * Use execution space operator== (#2136) It actually is part of the public interface * cm_test_all_sandia: more caraway module updates and cleanup (#2145) * Spmv perftest improvements (#2146) * Spmv perf test improvements - Add option to flush caches by filling a dummy buffer between iterations - Add option to call the non-reuse interface instead of handle/reuse interface - Fix modes T, H in nonsquare case (make x,y the correct length) * Fix mode help text * Update version to 4.3.0 * Revert "Kokkos Kernels: change the default offset ETI from size_t to int (#2140)" This reverts commit 3a5498d4353559b17e0712fe68241d6cf3de745a. * Fix signed/unsigned comparison warnings (#2150) This is only hit when spmv is called with integer scalars, which doesn't happen in our CI but does often in Tpetra. * SPMV tpl fixes, cusparse workaround (#2152) * SPMV tpl fixes, workaround * Avoid possible integer conversion warnings * Document cusparseSpMM algos that were tested * Merge pull request #2147 from lucbv/KK_Utils_cleanup KokkosKernels Utils: cleaning the zero_vector interface (cherry picked from commit 363868e4d4c04f48a4eda67a66936bd4d20c30db) * KokkosBlas1_axpby.hpp: change debug macro guard for printInformation (#2157) * KokkosBlas1_axpby.hpp: change debug macro guard for printInformation - resolves test failures in Trilinos (MueLu) that rely on gold file diff comparisons by removing extra output in debug builds * fix compilation error * Update changelog for 4.3.00 (#2148) * Update changelog for 4.3.00 * Update CHANGELOG.md --------- Co-authored-by: Luc Berger * FIx changelog typo * Fix merge artifacts * CMakeLists.txt: fix Kokkos_VERSION check * Merge pull request #2165 from ndellingwood/test-updates Updates from feedback runnig Trilinos testing (cherry picked from commit cacba80f76b6b726a28540d266aec66350078ab9) * Update master_history.txt for 4.3.0 * KokkosLapack_svd_tpl_spec_decl: defer to MKL spec when LAPACK also enabled Resolves redefintion of struct SVD compilation errors with both MKL and LAPACK are enabled Reported by @maartenarnst in https://github.com/trilinos/Trilinos/issues/12891 Co-authored-by: brian-kelley (cherry picked from commit 5bf5474dcc02d7c9cd25e9c9adb377c7c62a49fc) --------- Co-authored-by: Luc Berger-Vergiat Co-authored-by: Ernesto Prudencio Co-authored-by: Carl Pearson Co-authored-by: Evan Harvey Co-authored-by: Carl Pearson Co-authored-by: brian-kelley Co-authored-by: Sean Miller Co-authored-by: Junchao Zhang Co-authored-by: Junchao Zhang Co-authored-by: Brian Kelley Co-authored-by: James Foucar Co-authored-by: Damien L-G Co-authored-by: Caleb Schilly Co-authored-by: Damien L-G Co-authored-by: Vinh Dang --- .github/workflows/docs.yml | 4 +- .github/workflows/format.yml | 2 +- .github/workflows/osx.yml | 4 +- .gitignore | 5 +- .readthedocs.yaml | 35 + BUILD.md | 2 +- CHANGELOG.md | 94 + CMakeLists.txt | 23 +- CheckHostBlasReturnComplex.cmake | 8 +- README.md | 2 +- batched/KokkosBatched_Util.hpp | 24 - .../dense/impl/KokkosBatched_Gesv_Impl.hpp | 36 +- .../KokkosBatched_HostLevel_Gemm_Impl.hpp | 2 - .../KokkosBatched_SVD_Serial_Internal.hpp | 4 - .../impl/KokkosBatched_Trsm_Serial_Impl.hpp | 26 + .../impl/KokkosBatched_Trsm_Team_Impl.hpp | 30 + batched/dense/src/KokkosBatched_Gesv.hpp | 13 +- .../dense/src/KokkosBatched_Vector_SIMD.hpp | 8 + blas/CMakeLists.txt | 7 + .../KokkosBlas2_syr2_eti_spec_inst.cpp.in | 25 + .../KokkosBlas2_syr2_eti_spec_avail.hpp.in | 13 +- blas/impl/KokkosBlas1_axpby_impl.hpp | 559 ++-- blas/impl/KokkosBlas1_axpby_mv_impl.hpp | 1667 +++++----- blas/impl/KokkosBlas1_axpby_spec.hpp | 332 +- ...Blas1_axpby_unification_attempt_traits.hpp | 965 ++++++ blas/impl/KokkosBlas2_gemv_impl.hpp | 79 +- blas/impl/KokkosBlas2_gemv_spec.hpp | 6 +- blas/impl/KokkosBlas2_syr2_impl.hpp | 369 +++ blas/impl/KokkosBlas2_syr2_spec.hpp | 180 ++ blas/impl/KokkosBlas2_syr_impl.hpp | 2 +- blas/src/KokkosBlas1_axpby.hpp | 372 ++- blas/src/KokkosBlas1_dot.hpp | 44 +- blas/src/KokkosBlas1_swap.hpp | 16 +- blas/src/KokkosBlas2_ger.hpp | 11 +- blas/src/KokkosBlas2_syr.hpp | 7 +- blas/src/KokkosBlas2_syr2.hpp | 238 ++ blas/tpls/KokkosBlas1_dot_tpl_spec_avail.hpp | 42 +- blas/tpls/KokkosBlas1_dot_tpl_spec_decl.hpp | 436 ++- blas/tpls/KokkosBlas1_nrm1_tpl_spec_avail.hpp | 34 + blas/tpls/KokkosBlas1_nrm1_tpl_spec_decl.hpp | 835 +++-- .../KokkosBlas2_ger_tpl_spec_decl_blas.hpp | 18 +- .../KokkosBlas2_ger_tpl_spec_decl_cublas.hpp | 10 +- .../KokkosBlas2_ger_tpl_spec_decl_rocblas.hpp | 10 +- blas/tpls/KokkosBlas2_syr2_tpl_spec_avail.hpp | 205 ++ blas/tpls/KokkosBlas2_syr2_tpl_spec_decl.hpp | 35 + .../KokkosBlas2_syr2_tpl_spec_decl_blas.hpp | 317 ++ .../KokkosBlas2_syr2_tpl_spec_decl_cublas.hpp | 372 +++ ...KokkosBlas2_syr2_tpl_spec_decl_rocblas.hpp | 336 ++ blas/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp | 18 +- blas/tpls/KokkosBlas_Host_tpl.cpp | 939 +++--- blas/tpls/KokkosBlas_Host_tpl.hpp | 107 +- blas/unit_test/Test_Blas.hpp | 2 + blas/unit_test/Test_Blas1_axpby.hpp | 2 - .../Test_Blas1_axpby_unification.hpp | 2741 +++++++++++++++++ blas/unit_test/Test_Blas1_nrm1.hpp | 8 +- blas/unit_test/Test_Blas1_swap.hpp | 21 +- blas/unit_test/Test_Blas2_ger.hpp | 141 +- blas/unit_test/Test_Blas2_syr.hpp | 101 +- blas/unit_test/Test_Blas2_syr2.hpp | 1965 ++++++++++++ cm_generate_makefile.bash | 25 +- cmake/Dependencies.cmake | 4 +- cmake/KokkosKernels_config.h.in | 9 +- cmake/Modules/FindTPLCUBLAS.cmake | 61 +- cmake/Modules/FindTPLCUSOLVER.cmake | 46 + cmake/Modules/FindTPLCUSPARSE.cmake | 59 +- cmake/Modules/FindTPLROCSOLVER.cmake | 9 + cmake/kokkoskernels_components.cmake | 2 +- cmake/kokkoskernels_eti_devices.cmake | 28 +- cmake/kokkoskernels_features.cmake | 35 + cmake/kokkoskernels_tpls.cmake | 13 +- common/impl/KokkosKernels_ViewUtils.hpp | 6 - common/src/KokkosKernels_ExecSpaceUtils.hpp | 11 + .../src/KokkosKernels_PrintConfiguration.hpp | 20 + common/src/KokkosKernels_TplsVersion.hpp | 15 + common/src/KokkosKernels_Utils.hpp | 5 +- common/src/KokkosKernels_helpers.hpp | 10 +- common/src/KokkosLinAlg_config.h | 2 + docs/developer/apidocs/sparse.rst | 13 + docs/requirements.txt | 3 +- example/wiki/CMakeLists.txt | 1 + example/wiki/blas/CMakeLists.txt | 19 + example/wiki/blas/KokkosBlas2_wiki_ger.cpp | 23 + example/wiki/blas/KokkosBlas2_wiki_syr.cpp | 20 + example/wiki/blas/KokkosBlas2_wiki_syr2.cpp | 22 + .../sparse/KokkosSparse_wiki_bsrmatrix.cpp | 1 + graph/unit_test/CMakeLists.txt | 6 + graph/unit_test/Test_Graph_graph_color.hpp | 19 +- lapack/CMakeLists.txt | 16 +- .../svd/KokkosLapack_svd_eti_spec_inst.cpp.in | 26 + .../KokkosLapack_svd_eti_spec_avail.hpp.in | 24 + lapack/impl/KokkosLapack_gesv_spec.hpp | 46 +- lapack/impl/KokkosLapack_svd_impl.hpp | 34 + lapack/impl/KokkosLapack_svd_spec.hpp | 156 + lapack/src/KokkosLapack_gesv.hpp | 77 +- lapack/src/KokkosLapack_svd.hpp | 246 ++ lapack/tpls/KokkosLapack_Cuda_tpl.hpp | 23 + lapack/tpls/KokkosLapack_Host_tpl.cpp | 66 + lapack/tpls/KokkosLapack_Host_tpl.hpp | 6 + lapack/tpls/KokkosLapack_cusolver.hpp | 92 + .../tpls/KokkosLapack_gesv_tpl_spec_avail.hpp | 142 +- .../tpls/KokkosLapack_gesv_tpl_spec_decl.hpp | 916 +++--- lapack/tpls/KokkosLapack_magma.hpp | 8 +- .../tpls/KokkosLapack_svd_tpl_spec_avail.hpp | 171 + .../tpls/KokkosLapack_svd_tpl_spec_decl.hpp | 688 +++++ .../tpls/KokkosLapack_trtri_tpl_spec_decl.hpp | 1 + lapack/unit_test/Test_Lapack.hpp | 1 + lapack/unit_test/Test_Lapack_gesv.hpp | 194 +- lapack/unit_test/Test_Lapack_svd.hpp | 658 ++++ master_history.txt | 3 +- ode/impl/KokkosODE_BDF_impl.hpp | 532 ++++ ode/impl/KokkosODE_Newton_impl.hpp | 55 +- ode/src/KokkosODE_BDF.hpp | 227 ++ ode/src/KokkosODE_Newton.hpp | 10 +- ode/src/KokkosODE_Types.hpp | 13 +- ode/unit_test/Test_ODE.hpp | 1 + ode/unit_test/Test_ODE_BDF.hpp | 830 +++++ ode/unit_test/Test_ODE_Newton.hpp | 31 +- perf_test/CMakeLists.txt | 1 + ...s3_gemm_standalone_perf_test_benchmark.cpp | 14 +- perf_test/lapack/CMakeLists.txt | 8 + .../lapack/KokkosLapack_SVD_benchmark.cpp | 124 + perf_test/ode/CMakeLists.txt | 4 + perf_test/ode/KokkosODE_BDF.cpp | 266 ++ perf_test/performance/CMakeLists.txt | 4 +- perf_test/sparse/CMakeLists.txt | 5 + perf_test/sparse/KokkosSparse_kk_spmv.cpp | 266 +- perf_test/sparse/KokkosSparse_spadd.cpp | 10 +- .../sparse/KokkosSparse_spgemm_jacobi.cpp | 7 - perf_test/sparse/KokkosSparse_spiluk.cpp | 6 - .../sparse/KokkosSparse_spmv_benchmark.cpp | 89 +- perf_test/sparse/KokkosSparse_spmv_bsr.cpp | 40 +- .../KokkosSparse_spmv_bsr_benchmark.cpp | 16 +- perf_test/sparse/KokkosSparse_spmv_merge.cpp | 149 +- .../KokkosSparse_spmv_struct_tuning.cpp | 4 +- scripts/cm_test_all_sandia | 194 +- ...Sparse_spmv_bsrmatrix_eti_spec_inst.cpp.in | 4 +- ...rse_spmv_mv_bsrmatrix_eti_spec_inst.cpp.in | 2 - ...parse_spmv_bsrmatrix_eti_spec_avail.hpp.in | 2 - ...se_spmv_mv_bsrmatrix_eti_spec_avail.hpp.in | 2 - sparse/impl/KokkosSparse_coo2crs_impl.hpp | 16 +- .../impl/KokkosSparse_gauss_seidel_impl.hpp | 8 +- sparse/impl/KokkosSparse_gmres_impl.hpp | 2 +- sparse/impl/KokkosSparse_gmres_spec.hpp | 19 +- .../KokkosSparse_par_ilut_numeric_impl.hpp | 24 +- .../impl/KokkosSparse_spadd_numeric_impl.hpp | 31 +- .../impl/KokkosSparse_spadd_numeric_spec.hpp | 62 +- .../impl/KokkosSparse_spadd_symbolic_impl.hpp | 114 +- .../impl/KokkosSparse_spadd_symbolic_spec.hpp | 52 +- .../impl/KokkosSparse_spiluk_numeric_impl.hpp | 1127 +++---- .../impl/KokkosSparse_spiluk_numeric_spec.hpp | 16 +- .../KokkosSparse_spiluk_symbolic_impl.hpp | 14 +- .../impl/KokkosSparse_spmv_bsrmatrix_impl.hpp | 205 +- .../KokkosSparse_spmv_bsrmatrix_impl_v42.hpp | 15 - .../impl/KokkosSparse_spmv_bsrmatrix_spec.hpp | 181 +- sparse/impl/KokkosSparse_spmv_impl.hpp | 91 +- sparse/impl/KokkosSparse_spmv_impl_merge.hpp | 2 +- sparse/impl/KokkosSparse_spmv_spec.hpp | 163 +- .../KokkosSparse_sptrsv_cuSPARSE_impl.hpp | 187 +- .../impl/KokkosSparse_sptrsv_solve_impl.hpp | 344 ++- .../impl/KokkosSparse_sptrsv_solve_spec.hpp | 38 +- .../KokkosSparse_sptrsv_symbolic_impl.hpp | 57 +- .../KokkosSparse_sptrsv_symbolic_spec.hpp | 22 +- sparse/impl/KokkosSparse_trsv_impl.hpp | 1211 ++++---- sparse/impl/KokkosSparse_trsv_spec.hpp | 71 +- sparse/src/KokkosKernels_Handle.hpp | 39 +- sparse/src/KokkosSparse_BsrMatrix.hpp | 4 + sparse/src/KokkosSparse_CrsMatrix.hpp | 4 + sparse/src/KokkosSparse_LUPrec.hpp | 89 +- sparse/src/KokkosSparse_Utils.hpp | 131 +- sparse/src/KokkosSparse_Utils_mkl.hpp | 65 +- sparse/src/KokkosSparse_coo2crs.hpp | 6 - .../src/KokkosSparse_gauss_seidel_handle.hpp | 11 +- sparse/src/KokkosSparse_gmres.hpp | 28 +- sparse/src/KokkosSparse_spadd.hpp | 254 +- sparse/src/KokkosSparse_spadd_handle.hpp | 53 +- sparse/src/KokkosSparse_spiluk.hpp | 1 - sparse/src/KokkosSparse_spiluk_handle.hpp | 115 +- sparse/src/KokkosSparse_spmv.hpp | 1631 +++------- sparse/src/KokkosSparse_spmv_deprecated.hpp | 299 ++ sparse/src/KokkosSparse_spmv_handle.hpp | 389 +++ sparse/src/KokkosSparse_sptrsv.hpp | 374 ++- sparse/src/KokkosSparse_sptrsv_handle.hpp | 16 + sparse/src/KokkosSparse_trsv.hpp | 26 +- ...kkosSparse_spadd_numeric_tpl_spec_decl.hpp | 282 ++ ...kosSparse_spadd_symbolic_tpl_spec_decl.hpp | 238 ++ .../KokkosSparse_spadd_tpl_spec_avail.hpp | 117 +- ...osSparse_spgemm_numeric_tpl_spec_avail.hpp | 2 + ...sSparse_spgemm_symbolic_tpl_spec_avail.hpp | 2 + ...osSparse_spmv_bsrmatrix_tpl_spec_avail.hpp | 84 +- ...kosSparse_spmv_bsrmatrix_tpl_spec_decl.hpp | 1154 +++---- .../KokkosSparse_spmv_mv_tpl_spec_avail.hpp | 14 +- .../KokkosSparse_spmv_mv_tpl_spec_decl.hpp | 117 +- .../tpls/KokkosSparse_spmv_tpl_spec_avail.hpp | 86 +- .../tpls/KokkosSparse_spmv_tpl_spec_decl.hpp | 893 +++--- sparse/unit_test/Test_Sparse.hpp | 2 - sparse/unit_test/Test_Sparse_bspgemm.hpp | 9 - .../Test_Sparse_extractCrsDiagonalBlocks.hpp | 45 +- sparse/unit_test/Test_Sparse_gauss_seidel.hpp | 15 +- sparse/unit_test/Test_Sparse_gmres.hpp | 217 +- sparse/unit_test/Test_Sparse_par_ilut.hpp | 108 +- sparse/unit_test/Test_Sparse_spadd.hpp | 23 +- sparse/unit_test/Test_Sparse_spgemm.hpp | 10 - sparse/unit_test/Test_Sparse_spiluk.hpp | 1145 ++++--- sparse/unit_test/Test_Sparse_spmv.hpp | 318 +- sparse/unit_test/Test_Sparse_spmv_bsr.hpp | 121 +- sparse/unit_test/Test_Sparse_sptrsv.hpp | 1916 +++++------- sparse/unit_test/Test_Sparse_trsv.hpp | 183 +- sparse/unit_test/Test_vector_fixtures.hpp | 212 ++ test_common/KokkosKernels_TestUtils.hpp | 11 +- test_common/Test_HIP.hpp | 13 +- 210 files changed, 25302 insertions(+), 10548 deletions(-) create mode 100644 .readthedocs.yaml create mode 100644 blas/eti/generated_specializations_cpp/syr2/KokkosBlas2_syr2_eti_spec_inst.cpp.in rename sparse/tpls/KokkosSparse_spadd_tpl_spec_decl.hpp => blas/eti/generated_specializations_hpp/KokkosBlas2_syr2_eti_spec_avail.hpp.in (76%) create mode 100644 blas/impl/KokkosBlas1_axpby_unification_attempt_traits.hpp create mode 100644 blas/impl/KokkosBlas2_syr2_impl.hpp create mode 100644 blas/impl/KokkosBlas2_syr2_spec.hpp create mode 100644 blas/src/KokkosBlas2_syr2.hpp create mode 100644 blas/tpls/KokkosBlas2_syr2_tpl_spec_avail.hpp create mode 100644 blas/tpls/KokkosBlas2_syr2_tpl_spec_decl.hpp create mode 100644 blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_blas.hpp create mode 100644 blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_cublas.hpp create mode 100644 blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_rocblas.hpp create mode 100644 blas/unit_test/Test_Blas1_axpby_unification.hpp create mode 100644 blas/unit_test/Test_Blas2_syr2.hpp create mode 100644 cmake/Modules/FindTPLCUSOLVER.cmake create mode 100644 cmake/Modules/FindTPLROCSOLVER.cmake create mode 100644 example/wiki/blas/CMakeLists.txt create mode 100644 example/wiki/blas/KokkosBlas2_wiki_ger.cpp create mode 100644 example/wiki/blas/KokkosBlas2_wiki_syr.cpp create mode 100644 example/wiki/blas/KokkosBlas2_wiki_syr2.cpp create mode 100644 lapack/eti/generated_specializations_cpp/svd/KokkosLapack_svd_eti_spec_inst.cpp.in create mode 100644 lapack/eti/generated_specializations_hpp/KokkosLapack_svd_eti_spec_avail.hpp.in create mode 100644 lapack/impl/KokkosLapack_svd_impl.hpp create mode 100644 lapack/impl/KokkosLapack_svd_spec.hpp create mode 100644 lapack/src/KokkosLapack_svd.hpp create mode 100644 lapack/tpls/KokkosLapack_cusolver.hpp create mode 100644 lapack/tpls/KokkosLapack_svd_tpl_spec_avail.hpp create mode 100644 lapack/tpls/KokkosLapack_svd_tpl_spec_decl.hpp create mode 100644 lapack/unit_test/Test_Lapack_svd.hpp create mode 100644 ode/impl/KokkosODE_BDF_impl.hpp create mode 100644 ode/src/KokkosODE_BDF.hpp create mode 100644 ode/unit_test/Test_ODE_BDF.hpp create mode 100644 perf_test/lapack/CMakeLists.txt create mode 100644 perf_test/lapack/KokkosLapack_SVD_benchmark.cpp create mode 100644 perf_test/ode/KokkosODE_BDF.cpp create mode 100644 sparse/src/KokkosSparse_spmv_deprecated.hpp create mode 100644 sparse/src/KokkosSparse_spmv_handle.hpp create mode 100644 sparse/tpls/KokkosSparse_spadd_numeric_tpl_spec_decl.hpp create mode 100644 sparse/tpls/KokkosSparse_spadd_symbolic_tpl_spec_decl.hpp create mode 100644 sparse/unit_test/Test_vector_fixtures.hpp diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml index 558b6bd96d..04a1ba74b2 100644 --- a/.github/workflows/docs.yml +++ b/.github/workflows/docs.yml @@ -23,12 +23,12 @@ jobs: doxygen --version - name: checkout_kokkos_kernels - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: path: kokkos-kernels - name: checkout_kokkos - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: repository: kokkos/kokkos ref: develop diff --git a/.github/workflows/format.yml b/.github/workflows/format.yml index 220461fe62..6e2db4031a 100644 --- a/.github/workflows/format.yml +++ b/.github/workflows/format.yml @@ -13,7 +13,7 @@ jobs: clang-format-check: runs-on: ubuntu-20.04 steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 - name: Install Dependencies run: sudo apt install clang-format-8 diff --git a/.github/workflows/osx.yml b/.github/workflows/osx.yml index df6066d0d4..944807b032 100644 --- a/.github/workflows/osx.yml +++ b/.github/workflows/osx.yml @@ -50,12 +50,12 @@ jobs: steps: - name: checkout_kokkos_kernels - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: path: kokkos-kernels - name: checkout_kokkos - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: repository: kokkos/kokkos ref: ${{ github.base_ref }} diff --git a/.gitignore b/.gitignore index d64726e92e..6dcc5d6a5d 100644 --- a/.gitignore +++ b/.gitignore @@ -12,4 +12,7 @@ TAGS #Clangd indexing compile_commands.json .cache/ -.vscode/ \ No newline at end of file +.vscode/ + +#MacOS hidden files +.DS_Store diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 0000000000..519282a179 --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,35 @@ +# Read the Docs configuration file for Sphinx projects +# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details + +# Required +version: 2 + +# Set the OS, Python version and other tools you might need +build: + os: ubuntu-22.04 + tools: + python: "3.12" + # You can also specify other tool versions: + # nodejs: "20" + # rust: "1.70" + # golang: "1.20" + +# Build documentation in the "docs/" directory with Sphinx +sphinx: + configuration: docs/conf.py + # You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs + # builder: "dirhtml" + # Fail on all warnings to avoid broken references + # fail_on_warning: true + +# Optionally build your docs in additional formats such as PDF and ePub +# formats: +# - pdf +# - epub + +# Optional but recommended, declare the Python requirements required +# to build your documentation +# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html +python: + install: + - requirements: docs/requirements.txt \ No newline at end of file diff --git a/BUILD.md b/BUILD.md index 5be269bd7c..6fcea4dd33 100644 --- a/BUILD.md +++ b/BUILD.md @@ -227,7 +227,7 @@ endif() * KokkosKernels_LAPACK_ROOT: PATH * Location of LAPACK install root. * Default: None or the value of the environment variable LAPACK_ROOT if set -* KokkosKernels_LINALG_OPT_LEVEL: BOOL +* KokkosKernels_LINALG_OPT_LEVEL: BOOL **DEPRECATED** * Optimization level for KokkosKernels computational kernels: a nonnegative integer. Higher levels result in better performance that is more uniform for corner cases, but increase build time and library size. The default value is 1, which should give performance within ten percent of optimal on most platforms, for most problems. * Default: 1 * KokkosKernels_MAGMA_ROOT: PATH diff --git a/CHANGELOG.md b/CHANGELOG.md index 3ebb102517..6bc9cb65a6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,99 @@ # Change Log +## [4.3.00](https://github.com/kokkos/kokkos-kernels/tree/4.3.00) (2024-03-19) +[Full Changelog](https://github.com/kokkos/kokkos-kernels/compare/4.2.01...4.3.00) + +### New Features + +#### BLAS updates +- Syr2 [\#1942](https://github.com/kokkos/kokkos-kernels/pull/1942) + +#### LAPACK updates +- Adding cuSOLVER [\#2038](https://github.com/kokkos/kokkos-kernels/pull/2038) + - Fix for MAGMA with CUDA [\#2044](https://github.com/kokkos/kokkos-kernels/pull/2044) +- Adding rocSOLVER [\#2034](https://github.com/kokkos/kokkos-kernels/pull/2034) + - Fix rocSOLVER issue with Trilinos dependency [\#2037](https://github.com/kokkos/kokkos-kernels/pull/2037) +- Lapack - SVD [\#2092](https://github.com/kokkos/kokkos-kernels/pull/2092) + - Adding benchmark for SVD [\#2103](https://github.com/kokkos/kokkos-kernels/pull/2103) + - Quick return to fix cuSOLVER and improve performance [\#2107](https://github.com/kokkos/kokkos-kernels/pull/2107) + - Fix Intel MKL tolerance for SVD tests [\#2110](https://github.com/kokkos/kokkos-kernels/pull/2110) + +#### Sparse updates +- Add block support to all SPILUK algorithms [\#2064](https://github.com/kokkos/kokkos-kernels/pull/2064) + - Block spiluk follow up [\#2085](https://github.com/kokkos/kokkos-kernels/pull/2085) + - Make spiluk_handle::reset backwards compatible [\#2087](https://github.com/kokkos/kokkos-kernels/pull/2087) +- Sptrsv improvements + - Add sptrsv execution space overloads [\#1982](https://github.com/kokkos/kokkos-kernels/pull/1982) + - Refactor Test_Sparse_sptrsv [\#2102](https://github.com/kokkos/kokkos-kernels/pull/2102) + - Add support for BSR matrices to some trsv routines [\#2104](https://github.com/kokkos/kokkos-kernels/pull/2104) +- GMRES: Add support for BSR matrices [\#2097](https://github.com/kokkos/kokkos-kernels/pull/2097) +- Spmv handle [\#2126](https://github.com/kokkos/kokkos-kernels/pull/2126) +- Option to apply RCM reordering to extracted CRS diagonal blocks [\#2125](https://github.com/kokkos/kokkos-kernels/pull/2125) + +#### ODE updates +- Adding adaptive BDF methods [\#1930](https://github.com/kokkos/kokkos-kernels/pull/1930) + +#### Misc updates +- Add HIPManagedSpace support [\#2079](https://github.com/kokkos/kokkos-kernels/pull/2079) + +### Enhancements: + +#### BLAS +- Axpby: improvement on unification attempt logic and on the execution of a diversity of situations [\#1895](https://github.com/kokkos/kokkos-kernels/pull/1895) + +#### Misc updates +- Use execution space operator== [\#2136](https://github.com/kokkos/kokkos-kernels/pull/2136) + +#### TPL support +- Add TPL support for KokkosBlas::dot [\#1949](https://github.com/kokkos/kokkos-kernels/pull/1949) +- Add CUDA/HIP TPL support for KokkosSparse::spadd [\#1962](https://github.com/kokkos/kokkos-kernels/pull/1962) +- Don't call optimize_gemv for one-shot MKL spmv [\#2073](https://github.com/kokkos/kokkos-kernels/pull/2073) +- Async matrix release for MKL >= 2023.2 in SpMV [\#2074](https://github.com/kokkos/kokkos-kernels/pull/2074) +- BLAS - MKL: fixing HostBlas calls to handle MKL_INT type [\#2112](https://github.com/kokkos/kokkos-kernels/pull/2112) + +### Build System: +- Support CUBLAS_{LIBRARIES,LIBRARY_DIRS,INCLUDE_DIRS,ROOT} and KokkosKernels_CUBLAS_ROOT CMake options [\#2075](https://github.com/kokkos/kokkos-kernels/pull/2075) +- Link std::filesystem for IntelLLVM in perf_test/sparse [\#2055](https://github.com/kokkos/kokkos-kernels/pull/2055) +- Fix Cuda TPL finding [\#2098](https://github.com/kokkos/kokkos-kernels/pull/2098) +- CMake: error out in certain case [\#2115](https://github.com/kokkos/kokkos-kernels/pull/2115) + +### Documentation and Testing: +- par_ilut: Update documentation for fill_in_limit [\#2001](https://github.com/kokkos/kokkos-kernels/pull/2001) +- Wiki examples for BLAS2 functions are added [\#2122](https://github.com/kokkos/kokkos-kernels/pull/2122) +- github workflows: update to v4 (use Node 20) [\#2119](https://github.com/kokkos/kokkos-kernels/pull/2119) + +### Benchmarks: +- gemm3 perf test: user CUDA, SYCL, or HIP device for kokkos:initialize [\#2058](https://github.com/kokkos/kokkos-kernels/pull/2058) +- Lapack: adding svd benchmark [\#2103](https://github.com/kokkos/kokkos-kernels/pull/2103) +- Benchmark: modifying spmv benchmark to fix interface and run range of spmv tests [\#2135](https://github.com/kokkos/kokkos-kernels/pull/2135) + +### Cleanup: +- Experimental hip cleanup [\#1999](https://github.com/kokkos/kokkos-kernels/pull/1999) +- iostream clean-up in benchmarks [\#2004](https://github.com/kokkos/kokkos-kernels/pull/2004) +- Update: implicit capture of 'this' via '[=]' is deprecated in C++20 warnings [\#2076](https://github.com/kokkos/kokkos-kernels/pull/2076) +- Deprecate KOKKOSLINALG_OPT_LEVEL [\#2072](https://github.com/kokkos/kokkos-kernels/pull/2072) +- Remove all mentions of HBWSpace [\#2101](https://github.com/kokkos/kokkos-kernels/pull/2101) +- Change name of yaml-cpp to yamlcpp (trilinos/Trilinos#12710) [\#2099](https://github.com/kokkos/kokkos-kernels/pull/2099) +- Hands off namespace Kokkos::Impl - cleanup couple violations that snuck in [\#2094](https://github.com/kokkos/kokkos-kernels/pull/2094) +- Kokkos Kernels: update version guards to drop old version of Kokkos [\#2133](https://github.com/kokkos/kokkos-kernels/pull/2133) +- Sparse MKL: changing the location of the MKL_SAFE_CALL macro [\#2134](https://github.com/kokkos/kokkos-kernels/pull/2134) + +### Bug Fixes: +- Bspgemm cusparse hang [\#2008](https://github.com/kokkos/kokkos-kernels/pull/2008) +- bhalf_t fix for isnan function [\#2007](https://github.com/kokkos/kokkos-kernels/pull/2007) +- Fence Kokkos before timed iterations [\#2066](https://github.com/kokkos/kokkos-kernels/pull/2066) +- CUDA 11.2.1 / cuSPARSE 11.4.0 changed SpMV enums [\#2011](https://github.com/kokkos/kokkos-kernels/pull/2011) +- Fix the spadd API [\#2090](https://github.com/kokkos/kokkos-kernels/pull/2090) +- Axpby reduce deep copy calls [\#2081](https://github.com/kokkos/kokkos-kernels/pull/2081) +- Correcting BLAS test failures with cuda when ETI_ONLY = OFF (issue #2061) [\#2077](https://github.com/kokkos/kokkos-kernels/pull/2077) +- Fix weird Trilinos compiler error [\#2117](https://github.com/kokkos/kokkos-kernels/pull/2117) +- Fix for missing STL inclusion [\#2113](https://github.com/kokkos/kokkos-kernels/pull/2113) +- Fix build error in trsv on gcc8 [\#2111](https://github.com/kokkos/kokkos-kernels/pull/2111) +- Add a workaround for compilation errors with cuda-12.2.0 + gcc-12.3 [\#2108](https://github.com/kokkos/kokkos-kernels/pull/2108) +- Increase tolerance on gesv test (Fix #2123) [\#2124](https://github.com/kokkos/kokkos-kernels/pull/2124) +- Fix usage of RAII to set cusparse/rocsparse stream [\#2141](https://github.com/kokkos/kokkos-kernels/pull/2141) +- Spmv bsr matrix fix missing matrix descriptor (rocsparse) [\#2138](https://github.com/kokkos/kokkos-kernels/pull/2138) + ## [4.2.01](https://github.com/kokkos/kokkos-kernels/tree/4.2.01) (2024-01-17) [Full Changelog](https://github.com/kokkos/kokkos-kernels/compare/4.2.00...4.2.01) diff --git a/CMakeLists.txt b/CMakeLists.txt index 4847b51e9b..bd3d761bdb 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -10,8 +10,8 @@ SET(KOKKOSKERNELS_TOP_BUILD_DIR ${CMAKE_CURRENT_BINARY_DIR}) SET(KOKKOSKERNELS_TOP_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}) SET(KokkosKernels_VERSION_MAJOR 4) -SET(KokkosKernels_VERSION_MINOR 2) -SET(KokkosKernels_VERSION_PATCH 1) +SET(KokkosKernels_VERSION_MINOR 3) +SET(KokkosKernels_VERSION_PATCH 0) SET(KokkosKernels_VERSION "${KokkosKernels_VERSION_MAJOR}.${KokkosKernels_VERSION_MINOR}.${KokkosKernels_VERSION_PATCH}") #Set variables for config file @@ -127,13 +127,13 @@ ELSE() IF (NOT KOKKOSKERNELS_HAS_TRILINOS AND NOT KOKKOSKERNELS_HAS_PARENT) # This is a standalone build FIND_PACKAGE(Kokkos REQUIRED) - IF((${Kokkos_VERSION} VERSION_EQUAL "4.1.00") OR (${Kokkos_VERSION} VERSION_GREATER_EQUAL "4.2.00")) + IF((${Kokkos_VERSION} VERSION_GREATER_EQUAL "4.1.0") AND (${Kokkos_VERSION} VERSION_LESS_EQUAL "4.3.0")) MESSAGE(STATUS "Found Kokkos version ${Kokkos_VERSION} at ${Kokkos_DIR}") - IF((${Kokkos_VERSION} VERSION_GREATER "4.2.99")) + IF((${Kokkos_VERSION} VERSION_GREATER "4.3.99")) MESSAGE(WARNING "Configuring with Kokkos ${Kokkos_VERSION} which is newer than the expected develop branch - version check may need update") ENDIF() ELSE() - MESSAGE(FATAL_ERROR "Kokkos Kernels ${KokkosKernels_VERSION} requires 4.1.00, 4.2.00, 4.2.01 or develop") + MESSAGE(FATAL_ERROR "Kokkos Kernels ${KokkosKernels_VERSION} requires Kokkos_VERSION 4.1.0, 4.2.0, 4.2.1 or 4.3.0") ENDIF() ENDIF() @@ -156,9 +156,16 @@ ELSE() KOKKOSKERNELS_ADD_OPTION_AND_DEFINE( LINALG_OPT_LEVEL KOKKOSLINALG_OPT_LEVEL - "Optimization level for KokkosKernels computational kernels: a nonnegative integer. Higher levels result in better performance that is more uniform for corner cases, but increase build time and library size. The default value is 1, which should give performance within ten percent of optimal on most platforms, for most problems. Default: 1" + "DEPRECATED. Optimization level for KokkosKernels computational kernels: a nonnegative integer. Higher levels result in better performance that is more uniform for corner cases, but increase build time and library size. The default value is 1, which should give performance within ten percent of optimal on most platforms, for most problems. Default: 1" "1") + if (KokkosKernels_LINALG_OPT_LEVEL AND NOT KokkosKernels_LINALG_OPT_LEVEL STREQUAL "1") + message(WARNING "KokkosKernels_LINALG_OPT_LEVEL is deprecated!") + endif() + if(KokkosKernels_KOKKOSLINALG_OPT_LEVEL AND NOT KokkosKernels_KOKKOSLINALG_OPT_LEVEL STREQUAL "1") + message(WARNING "KokkosKernels_KOKKOSLINALG_OPT_LEVEL is deprecated!") + endif() + # Enable experimental features of KokkosKernels if set at configure # time. Default is no. KOKKOSKERNELS_ADD_OPTION_AND_DEFINE( @@ -375,8 +382,10 @@ ELSE() KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC MKL) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC CUBLAS) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC CUSPARSE) + KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC CUSOLVER) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC ROCBLAS) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC ROCSPARSE) + KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC ROCSOLVER) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC METIS) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC ARMPL) KOKKOSKERNELS_LINK_TPL(kokkoskernels PUBLIC MAGMA) @@ -425,7 +434,7 @@ ELSE() IF (KOKKOSKERNELS_ALL_COMPONENTS_ENABLED) IF (KokkosKernels_ENABLE_PERFTESTS) MESSAGE(STATUS "Enabling perf tests.") - KOKKOSKERNELS_ADD_TEST_DIRECTORIES(perf_test) + add_subdirectory(perf_test) # doesn't require KokkosKernels_ENABLE_TESTS=ON ENDIF () IF (KokkosKernels_ENABLE_EXAMPLES) MESSAGE(STATUS "Enabling examples.") diff --git a/CheckHostBlasReturnComplex.cmake b/CheckHostBlasReturnComplex.cmake index b9528ce45a..657a9f2286 100644 --- a/CheckHostBlasReturnComplex.cmake +++ b/CheckHostBlasReturnComplex.cmake @@ -21,8 +21,8 @@ FUNCTION(CHECK_HOST_BLAS_RETURN_COMPLEX VARNAME) extern \"C\" { void F77_BLAS_MANGLE(zdotc,ZDOTC)( - std::complex* result, const int* n, - const std::complex x[], const int* incx, + std::complex* result, const int* n, + const std::complex x[], const int* incx, const std::complex y[], const int* incy); } @@ -49,9 +49,9 @@ int main() { CHECK_CXX_SOURCE_RUNS("${SOURCE}" KK_BLAS_RESULT_AS_POINTER_ARG) IF(${KK_BLAS_RESULT_AS_POINTER_ARG}) - SET(VARNAME OFF) + SET(${VARNAME} OFF PARENT_SCOPE) ELSE() - SET(VARNAME ON) + SET(${VARNAME} ON PARENT_SCOPE) ENDIF() ENDFUNCTION() diff --git a/README.md b/README.md index 0da1057870..bdad1442ce 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -[![Generic badge](https://readthedocs.org/projects/pip/badge/?version=latest&style=flat)](https://kokkos-kernels.readthedocs.io/en/latest/) +[![Generic badge](https://readthedocs.org/projects/kokkos-kernels/badge/?version=latest)](https://kokkos-kernels.readthedocs.io/en/latest/) ![KokkosKernels](https://avatars2.githubusercontent.com/u/10199860?s=200&v=4) diff --git a/batched/KokkosBatched_Util.hpp b/batched/KokkosBatched_Util.hpp index 9078281e59..fc14bd5a19 100644 --- a/batched/KokkosBatched_Util.hpp +++ b/batched/KokkosBatched_Util.hpp @@ -626,18 +626,6 @@ KOKKOS_INLINE_FUNCTION auto subview_wrapper(ViewType v, IdxType1 i1, const Trans::NoTranspose) { return subview_wrapper(v, i1, i2, i3, layout_tag); } -#if KOKKOS_VERSION < 40099 -template -KOKKOS_INLINE_FUNCTION auto subview_wrapper(ViewType v, IdxType1 i1, - Kokkos::Impl::ALL_t i2, - Kokkos::Impl::ALL_t i3, - const BatchLayout::Left &layout_tag, - const Trans::Transpose) { - auto sv_nt = subview_wrapper(v, i1, i3, i2, layout_tag); - - return transpose_2d_view(sv_nt, layout_tag); -} -#else template KOKKOS_INLINE_FUNCTION auto subview_wrapper(ViewType v, IdxType1 i1, Kokkos::ALL_t i2, Kokkos::ALL_t i3, @@ -647,7 +635,6 @@ KOKKOS_INLINE_FUNCTION auto subview_wrapper(ViewType v, IdxType1 i1, return transpose_2d_view(sv_nt, layout_tag); } -#endif template KOKKOS_INLINE_FUNCTION auto subview_wrapper(ViewType v, IdxType1 i1, IdxType2 i2, IdxType3 i3, @@ -671,16 +658,6 @@ KOKKOS_INLINE_FUNCTION auto subview_wrapper( const BatchLayout::Right &layout_tag, const Trans::NoTranspose &) { return subview_wrapper(v, i1, i2, i3, layout_tag); } -#if KOKKOS_VERSION < 40099 -template -KOKKOS_INLINE_FUNCTION auto subview_wrapper( - ViewType v, IdxType1 i1, Kokkos::Impl::ALL_t i2, Kokkos::Impl::ALL_t i3, - const BatchLayout::Right &layout_tag, const Trans::Transpose &) { - auto sv_nt = subview_wrapper(v, i1, i3, i2, layout_tag); - - return transpose_2d_view(sv_nt, layout_tag); -} -#else template KOKKOS_INLINE_FUNCTION auto subview_wrapper( ViewType v, IdxType1 i1, Kokkos::ALL_t i2, Kokkos::ALL_t i3, @@ -689,7 +666,6 @@ KOKKOS_INLINE_FUNCTION auto subview_wrapper( return transpose_2d_view(sv_nt, layout_tag); } -#endif template KOKKOS_INLINE_FUNCTION auto subview_wrapper( ViewType v, IdxType1 i1, IdxType2 i2, IdxType3 i3, diff --git a/batched/dense/impl/KokkosBatched_Gesv_Impl.hpp b/batched/dense/impl/KokkosBatched_Gesv_Impl.hpp index e4e0d5b8b7..86d0d0873e 100644 --- a/batched/dense/impl/KokkosBatched_Gesv_Impl.hpp +++ b/batched/dense/impl/KokkosBatched_Gesv_Impl.hpp @@ -366,20 +366,24 @@ KOKKOS_INLINE_FUNCTION void TeamVectorHadamard1D(const MemberType &member, /// =========== template <> struct SerialGesv { - template + template KOKKOS_INLINE_FUNCTION static int invoke(const MatrixType A, - const VectorType X, - const VectorType Y, + const XVectorType X, + const YVectorType Y, const MatrixType tmp) { #if (KOKKOSKERNELS_DEBUG_LEVEL > 0) static_assert(Kokkos::is_view::value, "KokkosBatched::gesv: MatrixType is not a Kokkos::View."); - static_assert(Kokkos::is_view::value, - "KokkosBatched::gesv: VectorType is not a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosBatched::gesv: XVectorType is not a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosBatched::gesv: YVectorType is not a Kokkos::View."); static_assert(MatrixType::rank == 2, "KokkosBatched::gesv: MatrixType must have rank 2."); - static_assert(VectorType::rank == 1, - "KokkosBatched::gesv: VectorType must have rank 1."); + static_assert(XVectorType::rank == 1, + "KokkosBatched::gesv: XVectorType must have rank 1."); + static_assert(YVectorType::rank == 1, + "KokkosBatched::gesv: YVectorType must have rank 1."); // Check compatibility of dimensions at run time. @@ -462,20 +466,24 @@ struct SerialGesv { template <> struct SerialGesv { - template + template KOKKOS_INLINE_FUNCTION static int invoke(const MatrixType A, - const VectorType X, - const VectorType Y, + const XVectorType X, + const YVectorType Y, const MatrixType /*tmp*/) { #if (KOKKOSKERNELS_DEBUG_LEVEL > 0) static_assert(Kokkos::is_view::value, "KokkosBatched::gesv: MatrixType is not a Kokkos::View."); - static_assert(Kokkos::is_view::value, - "KokkosBatched::gesv: VectorType is not a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosBatched::gesv: XVectorType is not a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosBatched::gesv: YVectorType is not a Kokkos::View."); static_assert(MatrixType::rank == 2, "KokkosBatched::gesv: MatrixType must have rank 2."); - static_assert(VectorType::rank == 1, - "KokkosBatched::gesv: VectorType must have rank 1."); + static_assert(XVectorType::rank == 1, + "KokkosBatched::gesv: XVectorType must have rank 1."); + static_assert(YVectorType::rank == 1, + "KokkosBatched::gesv: YVectorType must have rank 1."); // Check compatibility of dimensions at run time. diff --git a/batched/dense/impl/KokkosBatched_HostLevel_Gemm_Impl.hpp b/batched/dense/impl/KokkosBatched_HostLevel_Gemm_Impl.hpp index f70fa6b963..464ea6d04a 100644 --- a/batched/dense/impl/KokkosBatched_HostLevel_Gemm_Impl.hpp +++ b/batched/dense/impl/KokkosBatched_HostLevel_Gemm_Impl.hpp @@ -93,11 +93,9 @@ int BatchedGemmImpl(BatchedGemmHandleType *const handle, const ScalarType alpha, case BaseKokkosBatchedAlgos::KK_SERIAL: case BaseHeuristicAlgos::SQUARE: case BaseTplAlgos::ARMPL: -#if KOKKOS_VERSION > 40099 assert(A.rank_dynamic() == 3 && "AViewType must have rank 3."); assert(B.rank_dynamic() == 3 && "BViewType must have rank 3."); assert(C.rank_dynamic() == 3 && "CViewType must have rank 3."); -#endif break; default: std::ostringstream os; diff --git a/batched/dense/impl/KokkosBatched_SVD_Serial_Internal.hpp b/batched/dense/impl/KokkosBatched_SVD_Serial_Internal.hpp index c9fd0417f6..34c92c2d24 100644 --- a/batched/dense/impl/KokkosBatched_SVD_Serial_Internal.hpp +++ b/batched/dense/impl/KokkosBatched_SVD_Serial_Internal.hpp @@ -55,11 +55,7 @@ struct SerialSVDInternal { value_type a = Kokkos::ArithTraits::one(); value_type b = -a11 - a22; value_type c = a11 * a22 - a21 * a21; -#if KOKKOS_VERSION >= 30699 using Kokkos::sqrt; -#else - using Kokkos::Experimental::sqrt; -#endif value_type sqrtDet = sqrt(b * b - 4 * a * c); e1 = (-b + sqrtDet) / (2 * a); e2 = (-b - sqrtDet) / (2 * a); diff --git a/batched/dense/impl/KokkosBatched_Trsm_Serial_Impl.hpp b/batched/dense/impl/KokkosBatched_Trsm_Serial_Impl.hpp index 268df195ce..4d094c24d2 100644 --- a/batched/dense/impl/KokkosBatched_Trsm_Serial_Impl.hpp +++ b/batched/dense/impl/KokkosBatched_Trsm_Serial_Impl.hpp @@ -176,6 +176,32 @@ struct SerialTrsm +struct SerialTrsm { + template + KOKKOS_INLINE_FUNCTION static int invoke(const ScalarType alpha, + const AViewType &A, + const BViewType &B) { + return SerialTrsmInternalLeftLower::invoke( + ArgDiag::use_unit_diag, B.extent(1), B.extent(0), alpha, A.data(), + A.stride_0(), A.stride_1(), B.data(), B.stride_1(), B.stride_0()); + } +}; + +template +struct SerialTrsm { + template + KOKKOS_INLINE_FUNCTION static int invoke(const ScalarType alpha, + const AViewType &A, + const BViewType &B) { + return SerialTrsmInternalLeftLower::invoke( + ArgDiag::use_unit_diag, B.extent(1), B.extent(0), alpha, A.data(), + A.stride_0(), A.stride_1(), B.data(), B.stride_1(), B.stride_0()); + } +}; + /// /// L/U/NT /// diff --git a/batched/dense/impl/KokkosBatched_Trsm_Team_Impl.hpp b/batched/dense/impl/KokkosBatched_Trsm_Team_Impl.hpp index 41fe47a35e..a7430775ea 100644 --- a/batched/dense/impl/KokkosBatched_Trsm_Team_Impl.hpp +++ b/batched/dense/impl/KokkosBatched_Trsm_Team_Impl.hpp @@ -99,6 +99,36 @@ struct TeamTrsm +struct TeamTrsm { + template + KOKKOS_INLINE_FUNCTION static int invoke(const MemberType &member, + const ScalarType alpha, + const AViewType &A, + const BViewType &B) { + return TeamTrsmInternalLeftLower::invoke( + member, ArgDiag::use_unit_diag, B.extent(1), B.extent(0), alpha, + A.data(), A.stride_0(), A.stride_1(), B.data(), B.stride_1(), + B.stride_0()); + } +}; + +template +struct TeamTrsm { + template + KOKKOS_INLINE_FUNCTION static int invoke(const MemberType &member, + const ScalarType alpha, + const AViewType &A, + const BViewType &B) { + return TeamTrsmInternalLeftLower::invoke( + member, ArgDiag::use_unit_diag, B.extent(1), B.extent(0), alpha, + A.data(), A.stride_0(), A.stride_1(), B.data(), B.stride_1(), + B.stride_0()); + } +}; + /// /// L/U/NT /// diff --git a/batched/dense/src/KokkosBatched_Gesv.hpp b/batched/dense/src/KokkosBatched_Gesv.hpp index 3abedfd0aa..c4821db459 100644 --- a/batched/dense/src/KokkosBatched_Gesv.hpp +++ b/batched/dense/src/KokkosBatched_Gesv.hpp @@ -63,11 +63,18 @@ struct Gesv { template struct SerialGesv { - template + template KOKKOS_INLINE_FUNCTION static int invoke(const MatrixType A, - const VectorType X, - const VectorType Y, + const XVectorType X, + const YVectorType Y, const MatrixType tmp); + + template + [[deprecated]] KOKKOS_INLINE_FUNCTION static int invoke( + const MatrixType A, const VectorType X, const VectorType Y, + const MatrixType tmp) { + return invoke(A, X, Y, tmp); + } }; /// \brief Team Batched GESV: diff --git a/batched/dense/src/KokkosBatched_Vector_SIMD.hpp b/batched/dense/src/KokkosBatched_Vector_SIMD.hpp index e27419e7c2..753904dbb9 100644 --- a/batched/dense/src/KokkosBatched_Vector_SIMD.hpp +++ b/batched/dense/src/KokkosBatched_Vector_SIMD.hpp @@ -513,6 +513,11 @@ class Vector, 4> { #if defined(__KOKKOSBATCHED_ENABLE_AVX__) #if defined(__AVX__) || defined(__AVX2__) + +#if CUDA_VERSION < 12022 +#undef _Float16 +#endif + #include namespace KokkosBatched { @@ -668,6 +673,9 @@ class Vector >, 2> { #endif /* #if defined(__AVX__) || defined(__AVX2__) */ #if defined(__AVX512F__) +#if CUDA_VERSION < 12022 +#undef _Float16 +#endif #include namespace KokkosBatched { diff --git a/blas/CMakeLists.txt b/blas/CMakeLists.txt index 869b152e7b..5bc7217cfd 100644 --- a/blas/CMakeLists.txt +++ b/blas/CMakeLists.txt @@ -297,6 +297,13 @@ KOKKOSKERNELS_GENERATE_ETI(Blas2_syr syr TYPE_LISTS FLOATS LAYOUTS DEVICES ) +KOKKOSKERNELS_GENERATE_ETI(Blas2_syr2 syr2 + COMPONENTS blas + HEADER_LIST ETI_HEADERS + SOURCE_LIST SOURCES + TYPE_LISTS FLOATS LAYOUTS DEVICES +) + KOKKOSKERNELS_GENERATE_ETI(Blas3_gemm gemm COMPONENTS blas HEADER_LIST ETI_HEADERS diff --git a/blas/eti/generated_specializations_cpp/syr2/KokkosBlas2_syr2_eti_spec_inst.cpp.in b/blas/eti/generated_specializations_cpp/syr2/KokkosBlas2_syr2_eti_spec_inst.cpp.in new file mode 100644 index 0000000000..669b5fd1aa --- /dev/null +++ b/blas/eti/generated_specializations_cpp/syr2/KokkosBlas2_syr2_eti_spec_inst.cpp.in @@ -0,0 +1,25 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#define KOKKOSKERNELS_IMPL_COMPILE_LIBRARY true +#include "KokkosKernels_config.h" +#include "KokkosBlas2_syr2_spec.hpp" + +namespace KokkosBlas { +namespace Impl { +@BLAS2_SYR2_ETI_INST_BLOCK@ +} //IMPL +} //Kokkos diff --git a/sparse/tpls/KokkosSparse_spadd_tpl_spec_decl.hpp b/blas/eti/generated_specializations_hpp/KokkosBlas2_syr2_eti_spec_avail.hpp.in similarity index 76% rename from sparse/tpls/KokkosSparse_spadd_tpl_spec_decl.hpp rename to blas/eti/generated_specializations_hpp/KokkosBlas2_syr2_eti_spec_avail.hpp.in index 8f5ad83ed7..9e7a01653e 100644 --- a/sparse/tpls/KokkosSparse_spadd_tpl_spec_decl.hpp +++ b/blas/eti/generated_specializations_hpp/KokkosBlas2_syr2_eti_spec_avail.hpp.in @@ -14,11 +14,12 @@ // //@HEADER -#ifndef KOKKOSPARSE_SPADD_TPL_SPEC_DECL_HPP_ -#define KOKKOSPARSE_SPADD_TPL_SPEC_DECL_HPP_ - -namespace KokkosSparse { -namespace Impl {} -} // namespace KokkosSparse +#ifndef KOKKOSBLAS2_SYR2_ETI_SPEC_AVAIL_HPP_ +#define KOKKOSBLAS2_SYR2_ETI_SPEC_AVAIL_HPP_ +namespace KokkosBlas { +namespace Impl { +@BLAS2_SYR2_ETI_AVAIL_BLOCK@ +} //IMPL +} //Kokkos #endif diff --git a/blas/impl/KokkosBlas1_axpby_impl.hpp b/blas/impl/KokkosBlas1_axpby_impl.hpp index 4e468b0e56..b919d76a94 100644 --- a/blas/impl/KokkosBlas1_axpby_impl.hpp +++ b/blas/impl/KokkosBlas1_axpby_impl.hpp @@ -19,14 +19,23 @@ #include "KokkosKernels_config.h" #include "Kokkos_Core.hpp" #include "Kokkos_InnerProductSpaceTraits.hpp" - -#ifndef KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY -#define KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY 2 -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY +#include "KokkosKernels_Error.hpp" namespace KokkosBlas { namespace Impl { +template +constexpr typename std::enable_if, int>::type +axpbyVarExtent(T& v) { + return v.extent(0); +} + +template +constexpr typename std::enable_if, int>::type +axpbyVarExtent(T&) { + return 0; +} + // // axpby // @@ -44,8 +53,8 @@ namespace Impl { // // The template parameters scalar_x and scalar_y correspond to alpha // resp. beta in the operation y = alpha*x + beta*y. The values -1, -// 0, and -1 correspond to literal values of those coefficients. The -// value 2 tells the functor to use the corresponding vector of +// 0, and -1 correspond to literal values of those coefficients. +// The value 2 tells the functor to use the corresponding vector of // coefficients. Any literal coefficient of zero has BLAS semantics // of ignoring the corresponding (multi)vector entry. This does not // apply to coefficients in the a and b vectors, if they are used. @@ -61,32 +70,39 @@ struct Axpby_Functor { AV m_a; BV m_b; - Axpby_Functor(const XV& x, const YV& y, const AV& a, const BV& b, + Axpby_Functor(const XV& x, const YV& y, const AV& av, const BV& bv, const SizeType startingColumn) - : m_x(x), m_y(y), m_a(a), m_b(b) { + : m_x(x), m_y(y), m_a(av), m_b(bv) { static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_Functor: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_Functor(ABgeneric)" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_Functor: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_Functor(ABgeneric)" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_Functor: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_Functor(ABgeneric)" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YV::rank == (int)XV::rank, - "KokkosBlas::Impl::" - "Axpby_Functor: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_Functor(ABgeneric)" + ": X and Y must have the same rank."); static_assert(YV::rank == 1, - "KokkosBlas::Impl::Axpby_Functor: " - "XV and YV must have rank 1."); - + "KokkosBlas::Impl::Axpby_Functor(ABgeneric)" + ": XV and YV must have rank 1."); + static_assert((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2), + "KokkosBlas::Impl::Axpby_Functor(ABgeneric)" + ": scalar_x and/or scalar_y are out of range."); if (startingColumn != 0) { - m_a = Kokkos::subview( - a, std::make_pair(startingColumn, SizeType(a.extent(0)))); - m_b = Kokkos::subview( - b, std::make_pair(startingColumn, SizeType(b.extent(0)))); + if (axpbyVarExtent(m_a) > 1) { + m_a = Kokkos::subview( + av, std::make_pair(startingColumn, SizeType(av.extent(0)))); + } + if (axpbyVarExtent(m_b) > 1) { + m_b = Kokkos::subview( + bv, std::make_pair(startingColumn, SizeType(bv.extent(0)))); + } } } @@ -96,73 +112,83 @@ struct Axpby_Functor { // are template parameters), so the compiler should evaluate these // branches at compile time. -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY <= 2 - - if (scalar_x == 0 && scalar_y == 0) { - m_y(i) = ATS::zero(); - } - if (scalar_x == 0 && scalar_y == 2) { - m_y(i) = m_b(0) * m_y(i); - } - if (scalar_x == 2 && scalar_y == 0) { - m_y(i) = m_a(0) * m_x(i); - } - if (scalar_x == 2 && scalar_y == 2) { - m_y(i) = m_a(0) * m_x(i) + m_b(0) * m_y(i); - } - -#else // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - if (scalar_x == 0 && scalar_y == 0) { - m_y(i) = ATS::zero(); - } - if (scalar_x == 0 && scalar_y == -1) { - m_y(i) = -m_y(i); - } - if (scalar_x == 0 && scalar_y == 1) { - return; // m_y(i) = m_y(i); - } - if (scalar_x == 0 && scalar_y == 2) { - m_y(i) = m_b(0) * m_y(i); - } - if (scalar_x == -1 && scalar_y == 0) { - m_y(i) = -m_x(i); - } - if (scalar_x == -1 && scalar_y == -1) { - m_y(i) = -m_x(i) - m_y(i); - } - if (scalar_x == -1 && scalar_y == 1) { - m_y(i) = -m_x(i) + m_y(i); - } - if (scalar_x == -1 && scalar_y == 2) { - m_y(i) = -m_x(i) + m_b(0) * m_y(i); + // ************************************************************** + // Possibilities with 'scalar_x == 0' + // ************************************************************** + if constexpr (scalar_x == 0) { + if constexpr (scalar_y == 0) { + m_y(i) = ATS::zero(); + } else if constexpr (scalar_y == -1) { + m_y(i) = -m_y(i); + } else if constexpr (scalar_y == 1) { + // Nothing to do: m_y(i) = m_y(i); + } else if constexpr (scalar_y == 2) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { + m_y(i) = + Kokkos::ArithTraits::zero(); + } else { + m_y(i) = m_b(0) * m_y(i); + } + } + } + // ************************************************************** + // Possibilities with 'scalar_x == -1' + // ************************************************************** + else if constexpr (scalar_x == -1) { + if constexpr (scalar_y == 0) { + m_y(i) = -m_x(i); + } else if constexpr (scalar_y == -1) { + m_y(i) = -m_x(i) - m_y(i); + } else if constexpr (scalar_y == 1) { + m_y(i) = -m_x(i) + m_y(i); + } else if constexpr (scalar_y == 2) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { + m_y(i) = -m_x(i); + } else { + m_y(i) = -m_x(i) + m_b(0) * m_y(i); + } + } + } + // ************************************************************** + // Possibilities with 'scalar_x == 1' + // ************************************************************** + else if constexpr (scalar_x == 1) { + if constexpr (scalar_y == 0) { + m_y(i) = m_x(i); + } else if constexpr (scalar_y == -1) { + m_y(i) = m_x(i) - m_y(i); + } else if constexpr (scalar_y == 1) { + m_y(i) = m_x(i) + m_y(i); + } else if constexpr (scalar_y == 2) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { + m_y(i) = m_x(i); + } else { + m_y(i) = m_x(i) + m_b(0) * m_y(i); + } + } + } + // ************************************************************** + // Possibilities with 'scalar_x == 2' + // ************************************************************** + else if constexpr (scalar_x == 2) { + if constexpr (scalar_y == 0) { + m_y(i) = m_a(0) * m_x(i); + } else if constexpr (scalar_y == -1) { + m_y(i) = m_a(0) * m_x(i) - m_y(i); + } else if constexpr (scalar_y == 1) { + m_y(i) = m_a(0) * m_x(i) + m_y(i); + } else if constexpr (scalar_y == 2) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { + m_y(i) = m_a(0) * m_x(i); + } else { + m_y(i) = m_a(0) * m_x(i) + m_b(0) * m_y(i); + } + } } - if (scalar_x == 1 && scalar_y == 0) { - m_y(i) = m_x(i); - } - if (scalar_x == 1 && scalar_y == -1) { - m_y(i) = m_x(i) - m_y(i); - } - if (scalar_x == 1 && scalar_y == 1) { - m_y(i) = m_x(i) + m_y(i); - } - if (scalar_x == 1 && scalar_y == 2) { - m_y(i) = m_x(i) + m_b(0) * m_y(i); - } - if (scalar_x == 2 && scalar_y == 0) { - m_y(i) = m_a(0) * m_x(i); - } - if (scalar_x == 2 && scalar_y == -1) { - m_y(i) = m_a(0) * m_x(i) - m_y(i); - } - if (scalar_x == 2 && scalar_y == 1) { - m_y(i) = m_a(0) * m_x(i) + m_y(i); - } - if (scalar_x == 2 && scalar_y == 2) { - m_y(i) = m_a(0) * m_x(i) + m_b(0) * m_y(i); - } - -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY } }; @@ -177,8 +203,8 @@ struct Axpby_Functor { // // The template parameters scalar_x and scalar_y correspond to alpha // resp. beta in the operation y = alpha*x + beta*y. The values -1, -// 0, and -1 correspond to literal values of those coefficients. The -// value 2 tells the functor to use the corresponding vector of +// 0, and -1 correspond to literal values of those coefficients. +// The value 2 tells the functor to use the corresponding vector of // coefficients. Any literal coefficient of zero has BLAS semantics // of ignoring the corresponding (multi)vector entry. This does not // apply to coefficients in the a and b vectors, if they are used. @@ -201,22 +227,26 @@ struct Axpby_Functor::value, - "KokkosBlas::Impl::" - "Axpby_Functor: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_Functor(ABscalars)" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_Functor: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_Functor(ABscalars)" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_Functor: R is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_Functor(ABscalars)" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YV::rank == (int)XV::rank, - "KokkosBlas::Impl::" - "Axpby_Functor: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_Functor(ABscalars)" + ": X and Y must have the same rank."); static_assert(YV::rank == 1, - "KokkosBlas::Impl::Axpby_Functor: " + "KokkosBlas::Impl::Axpby_Functor(ABscalars)" "XV and YV must have rank 1."); + static_assert((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2), + "KokkosBlas::Impl::Axpby_Functor(ABscalars)" + ": scalar_x and/or scalar_y are out of range."); } KOKKOS_INLINE_FUNCTION @@ -225,80 +255,69 @@ struct Axpby_Functor(ATS::zero()); - } - if (scalar_x == 0 && scalar_y == 2) { - m_y(i) = static_cast(m_b * m_y(i)); - } - if (scalar_x == 2 && scalar_y == 0) { - m_y(i) = static_cast(m_a * m_x(i)); - } - if (scalar_x == 2 && scalar_y == 2) { - m_y(i) = static_cast(m_a * m_x(i) + - m_b * m_y(i)); - } - -#else // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - if (scalar_x == 0 && scalar_y == 0) { - m_y(i) = ATS::zero(); - } - if (scalar_x == 0 && scalar_y == -1) { - m_y(i) = -m_y(i); - } - if (scalar_x == 0 && scalar_y == 1) { - return; // m_y(i) = m_y(i); - } - if (scalar_x == 0 && scalar_y == 2) { - m_y(i) = m_b * m_y(i); - } - if (scalar_x == -1 && scalar_y == 0) { - m_y(i) = -m_x(i); - } - if (scalar_x == -1 && scalar_y == -1) { - m_y(i) = -m_x(i) - m_y(i); + // ************************************************************** + // Possibilities with 'scalar_x == 0' + // ************************************************************** + if constexpr (scalar_x == 0) { + if constexpr (scalar_y == 0) { + m_y(i) = ATS::zero(); + } else if constexpr (scalar_y == -1) { + m_y(i) = -m_y(i); + } else if constexpr (scalar_y == 1) { + // Nothing to do: m_y(i) = m_y(i); + } else if constexpr (scalar_y == 2) { + m_y(i) = m_b * m_y(i); + } + } + // ************************************************************** + // Possibilities with 'scalar_x == -1' + // ************************************************************** + else if constexpr (scalar_x == -1) { + if constexpr (scalar_y == 0) { + m_y(i) = -m_x(i); + } else if constexpr (scalar_y == -1) { + m_y(i) = -m_x(i) - m_y(i); + } else if constexpr (scalar_y == 1) { + m_y(i) = -m_x(i) + m_y(i); + } else if constexpr (scalar_y == 2) { + m_y(i) = -m_x(i) + m_b * m_y(i); + } + } + // ************************************************************** + // Possibilities with 'scalar_x == 1' + // ************************************************************** + else if constexpr (scalar_x == 1) { + if constexpr (scalar_y == 0) { + m_y(i) = m_x(i); + } else if constexpr (scalar_y == -1) { + m_y(i) = m_x(i) - m_y(i); + } else if constexpr (scalar_y == 1) { + m_y(i) = m_x(i) + m_y(i); + } else if constexpr (scalar_y == 2) { + m_y(i) = m_x(i) + m_b * m_y(i); + } + } + // ************************************************************** + // Possibilities with 'scalar_x == 2' + // ************************************************************** + else if constexpr (scalar_x == 2) { + if constexpr (scalar_y == 0) { + m_y(i) = m_a * m_x(i); + } else if constexpr (scalar_y == -1) { + m_y(i) = m_a * m_x(i) - m_y(i); + } else if constexpr (scalar_y == 1) { + m_y(i) = m_a * m_x(i) + m_y(i); + } else if constexpr (scalar_y == 2) { + m_y(i) = m_a * m_x(i) + m_b * m_y(i); + } } - if (scalar_x == -1 && scalar_y == 1) { - m_y(i) = -m_x(i) + m_y(i); - } - if (scalar_x == -1 && scalar_y == 2) { - m_y(i) = -m_x(i) + m_b * m_y(i); - } - if (scalar_x == 1 && scalar_y == 0) { - m_y(i) = m_x(i); - } - if (scalar_x == 1 && scalar_y == -1) { - m_y(i) = m_x(i) - m_y(i); - } - if (scalar_x == 1 && scalar_y == 1) { - m_y(i) = m_x(i) + m_y(i); - } - if (scalar_x == 1 && scalar_y == 2) { - m_y(i) = m_x(i) + m_b * m_y(i); - } - if (scalar_x == 2 && scalar_y == 0) { - m_y(i) = m_a * m_x(i); - } - if (scalar_x == 2 && scalar_y == -1) { - m_y(i) = m_a * m_x(i) - m_y(i); - } - if (scalar_x == 2 && scalar_y == 1) { - m_y(i) = m_a * m_x(i) + m_y(i); - } - if (scalar_x == 2 && scalar_y == 2) { - m_y(i) = m_a * m_x(i) + m_b * m_y(i); - } - -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY } }; // Variant of Axpby_MV_Generic for single vectors (1-D Views) x and y. -// As above, either av and bv are both 1-D Views (and only the first -// entry of each will be read), or both av and bv are scalars. +// As above, av and bv are either: +// - both 1-D views (and only the first entry of each are read), or +// - both scalars. // // This takes the starting column, so that if av and bv are both 1-D // Views, then the functor can take a subview if appropriate. @@ -306,7 +325,7 @@ template void Axpby_Generic(const execution_space& space, const AV& av, const XV& x, const BV& bv, const YV& y, const SizeType startingColumn, - int a = 2, int b = 2) { + int scalar_x = 2, int scalar_y = 2) { static_assert(Kokkos::is_view::value, "KokkosBlas::Impl::" "Axpby_Generic: X is not a Kokkos::View."); @@ -325,118 +344,106 @@ void Axpby_Generic(const execution_space& space, const AV& av, const XV& x, "KokkosBlas::Impl::Axpby_Generic: " "XV and YV must have rank 1."); - const SizeType numRows = x.extent(0); - Kokkos::RangePolicy policy(space, 0, numRows); - - if (a == 0 && b == 0) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S0", policy, op); - return; - } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 0 && b == -1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S1", policy, op); - return; + if ((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2)) { + // Ok + } else { + KokkosKernels::Impl::throw_runtime_exception( + "KokkosBlas::Impl::Axpby_Generic()" + ": scalar_x and/or scalar_y are out of range."); } - if (a == 0 && b == 1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S2", policy, op); - return; - } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 0 && b == 2) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S3", policy, op); - return; - } + const SizeType numRows = x.extent(0); + Kokkos::RangePolicy policy(space, 0, numRows); -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - // a == -1 - if (a == -1 && b == 0) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S4", policy, op); - return; - } - if (a == -1 && b == -1) { - Axpby_Functor op(x, y, av, bv, + // **************************************************************** + // Possibilities with 'scalar_x == 0' + // **************************************************************** + if (scalar_x == 0) { + if (scalar_y == 0) { + Axpby_Functor op(x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S5", policy, op); - return; - } - if (a == -1 && b == 1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S6", policy, op); - return; - } - if (a == -1 && b == 2) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S7", policy, op); - return; - } - // a == 1 - if (a == 1 && b == 0) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S8", policy, op); - return; - } - if (a == 1 && b == -1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S9", policy, op); - return; - } - if (a == 1 && b == 1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S10", policy, op); - return; - } - if (a == 1 && b == 2) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S11", policy, op); - return; + Kokkos::parallel_for("KokkosBlas::Axpby::S0", policy, op); + } else if (scalar_y == -1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S1", policy, op); + } else if (scalar_y == 1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S2", policy, op); + } else if (scalar_y == 2) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S3", policy, op); + } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - // a == 2 - if (a == 2 && b == 0) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S12", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == -1' + // **************************************************************** + else if (scalar_x == -1) { + if (scalar_y == 0) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S4", policy, op); + } else if (scalar_y == -1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S5", policy, op); + } else if (scalar_y == 1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S6", policy, op); + } else if (scalar_y == 2) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S7", policy, op); + } } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 2 && b == -1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S13", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 1' + // **************************************************************** + else if (scalar_x == 1) { + if (scalar_y == 0) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S8", policy, op); + } else if (scalar_y == -1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S9", policy, op); + } else if (scalar_y == 1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S10", policy, op); + } else if (scalar_y == 2) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S11", policy, op); + } } - if (a == 2 && b == 1) { - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S14", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 2' + // **************************************************************** + else if (scalar_x == 2) { + if (scalar_y == 0) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S12", policy, op); + } else if (scalar_y == -1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S13", policy, op); + } else if (scalar_y == 1) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S14", policy, op); + } else if (scalar_y == 2) { + Axpby_Functor op(x, y, av, bv, + startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::S15", policy, op); + } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - // a and b arbitrary (not -1, 0, or 1) - Axpby_Functor op(x, y, av, bv, - startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::S15", policy, op); } } // namespace Impl diff --git a/blas/impl/KokkosBlas1_axpby_mv_impl.hpp b/blas/impl/KokkosBlas1_axpby_mv_impl.hpp index 32653b9cce..7db7b0abe3 100644 --- a/blas/impl/KokkosBlas1_axpby_mv_impl.hpp +++ b/blas/impl/KokkosBlas1_axpby_mv_impl.hpp @@ -35,8 +35,8 @@ namespace Impl { // // The template parameters scalar_x and scalar_y correspond to alpha // resp. beta in the operation y = alpha*x + beta*y. The values -1, -// 0, and -1 correspond to literal values of those coefficients. The -// value 2 tells the functor to use the corresponding vector of +// 0, and -1 correspond to literal values of those coefficients. +// The value 2 tells the functor to use the corresponding vector of // coefficients. Any literal coefficient of zero has BLAS semantics // of ignoring the corresponding (multi)vector entry. This does not // apply to coefficients in the a and b vectors, if they are used. @@ -52,39 +52,41 @@ struct Axpby_MV_Functor { AV m_a; BV m_b; - Axpby_MV_Functor(const XMV& X, const YMV& Y, const AV& a, const BV& b) - : numCols(X.extent(1)), m_x(X), m_y(Y), m_a(a), m_b(b) { - // XMV and YMV must be Kokkos::View specializations. + Axpby_MV_Functor(const XMV& X, const YMV& Y, const AV& av, const BV& bv) + : numCols(X.extent(1)), m_x(X), m_y(Y), m_a(av), m_b(bv) { static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: a is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": 'a' is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: b is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": 'b' is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: Y is not a Kokkos::View."); - // YMV must be nonconst (else it can't be an output argument). + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Functor: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::Axpby_MV_Functor: " - "X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Functor: " - "XMV and YMV must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": XMV and YMV must have rank 2."); static_assert(AV::rank == 1, - "KokkosBlas::Impl::Axpby_MV_Functor: " - "AV must have rank 1."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": AV must have rank 1."); static_assert(BV::rank == 1, - "KokkosBlas::Impl::Axpby_MV_Functor: " - "BV must have rank 1."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": BV must have rank 1."); + static_assert((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2), + "KokkosBlas::Impl::Axpby_MV_Functor(ABgeneric)" + ": scalar_x and/or scalar_y are out of range."); } KOKKOS_INLINE_FUNCTION @@ -92,175 +94,358 @@ struct Axpby_MV_Functor { // scalar_x and scalar_y are compile-time constants (since they // are template parameters), so the compiler should evaluate these // branches at compile time. - if (scalar_x == 0 && scalar_y == 0) { + + // ************************************************************** + // Possibilities with 'scalar_x == 0' + // ************************************************************** + if constexpr (scalar_x == 0) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = ATS::zero(); - } - } - if (scalar_x == 0 && scalar_y == -1) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = ATS::zero(); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = -m_y(i, k); - } - } - if (scalar_x == 0 && scalar_y == 1) { - return; // Y(i,j) := Y(i,j) - } - if (scalar_x == 0 && scalar_y == 2) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_y(i, k); + } + } else if constexpr (scalar_y == 1) { + // Nothing to do: Y(i,j) := Y(i,j) + } else if constexpr (scalar_y == 2) { + if (m_b.extent(0) == 1) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_b(k) * m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 0) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = Kokkos::ArithTraits< + typename YMV::non_const_value_type>::zero(); + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = -m_x(i, k); - } - } - if (scalar_x == -1 && scalar_y == -1) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_b(0) * m_y(i, k); + } + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = -m_x(i, k) - m_y(i, k); + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_b(k) * m_y(i, k); + } + } } } - if (scalar_x == -1 && scalar_y == 1) { + // ************************************************************** + // Possibilities with 'scalar_x == -1' + // ************************************************************** + else if constexpr (scalar_x == -1) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = -m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 2) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = -m_x(i, k) + m_b(k) * m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 0) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_x(i, k); - } - } - if (scalar_x == 1 && scalar_y == -1) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { + if (m_b.extent(0) == 1) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 1) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_x(i, k); + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 2) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_x(i, k) + m_b(k) * m_y(i, k); + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = -m_x(i, k) + m_b(k) * m_y(i, k); + } + } } } - if (scalar_x == 2 && scalar_y == 0) { + // ************************************************************** + // Possibilities with 'scalar_x == 1' + // ************************************************************** + else if constexpr (scalar_x == 1) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k); - } - } - if (scalar_x == 2 && scalar_y == -1) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 1) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 2) { + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { + if (m_b.extent(0) == 1) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif #ifdef KOKKOS_ENABLE_PRAGMA_VECTOR #pragma vector always #endif - for (size_type k = 0; k < numCols; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) + m_b(k) * m_y(i, k); + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_x(i, k) + m_b(k) * m_y(i, k); + } + } } } - } + // ************************************************************** + // Possibilities with 'scalar_x == 2' + // ************************************************************** + else if constexpr (scalar_x == 2) { + if constexpr (scalar_y == 0) { + if (m_a.extent(0) == 1) { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k); + } + } + } else if constexpr (scalar_y == -1) { + if (m_a.extent(0) == 1) { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) - m_y(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) - m_y(i, k); + } + } + } else if constexpr (scalar_y == 1) { + if (m_a.extent(0) == 1) { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) + m_y(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) + m_y(i, k); + } + } + } else if constexpr (scalar_y == 2) { + if (m_a.extent(0) == 1) { + if (m_b.extent(0) == 1) { + if (m_b(0) == Kokkos::ArithTraits< + typename BV::non_const_value_type>::zero()) { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) + m_b(k) * m_y(i, k); + } + } + } else { + if (m_b.extent(0) == 1) { + if (m_b(0) == Kokkos::ArithTraits< + typename BV::non_const_value_type>::zero()) { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP +#pragma ivdep +#endif +#ifdef KOKKOS_ENABLE_PRAGMA_VECTOR +#pragma vector always +#endif + for (size_type k = 0; k < numCols; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) + m_b(k) * m_y(i, k); + } + } + } + } // if constexpr (scalar_y == ...) else if + } // if constexpr (scalar_x == ...) else if + } // void operator() }; // Variant of Axpby_MV_Functor, where a and b are scalars. @@ -268,13 +453,13 @@ struct Axpby_MV_Functor { // // 1. Y(i,j) = alpha*X(i,j) + beta*Y(i,j) for alpha,beta in -1,0,1 // 2. Y(i,j) = a*X(i,j) + beta*Y(i,j) for beta in -1,0,1 -// 3. Y(i,j) = alpha*X(i,j) + beta*Y(i,j) for alpha in -1,0,1 +// 3. Y(i,j) = alpha*X(i,j) + b*Y(i,j) for alpha in -1,0,1 // 4. Y(i,j) = a*X(i,j) + b*Y(i,j) // // The template parameters scalar_x and scalar_y correspond to alpha // resp. beta in the operation y = alpha*x + beta*y. The values -1, -// 0, and -1 correspond to literal values of those coefficients. The -// value 2 tells the functor to use the corresponding vector of +// 0, and -1 correspond to literal values of those coefficients. +// The value 2 tells the functor to use the corresponding vector of // coefficients. Any literal coefficient of zero has BLAS semantics // of ignoring the corresponding (multi)vector entry. This does not // apply to coefficients in the a and b vectors, if they are used. @@ -299,22 +484,26 @@ struct Axpby_MV_Functor::value, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABscalars)" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABscalars)" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Functor: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABscalars)" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::" - "Axpby_MV_Functor: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABscalars)" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Functor: " - "XMV and YMV must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Functor(ABscalars)" + ": XMV and YMV must have rank 2."); + static_assert((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2), + "KokkosBlas::Impl::Axpby_MV_Functor(ABscalars)" + ": scalar_x and/or scalar_y are out of range."); } KOKKOS_INLINE_FUNCTION @@ -322,175 +511,184 @@ struct Axpby_MV_Functor::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: a is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": 'a' is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: b is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": 'b' is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: " - "X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: " - "XMV and YMV must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": XMV and YMV must have rank 2."); static_assert(AV::rank == 1, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: " - "AV must have rank 1."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": AV must have rank 1."); static_assert(BV::rank == 1, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: " - "BV must have rank 1."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": BV must have rank 1."); + static_assert((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2), + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABgeneric)" + ": scalar_x and/or scalar_y are out of range."); if (startingColumn != 0) { - m_a = Kokkos::subview( - a, std::make_pair(startingColumn, SizeType(a.extent(0)))); - m_b = Kokkos::subview( - b, std::make_pair(startingColumn, SizeType(b.extent(0)))); + if (axpbyVarExtent(m_a) > 1) { + m_a = Kokkos::subview( + av, std::make_pair(startingColumn, SizeType(av.extent(0)))); + } + if (axpbyVarExtent(m_b) > 1) { + m_b = Kokkos::subview( + bv, std::make_pair(startingColumn, SizeType(bv.extent(0)))); + } } } @@ -553,167 +759,269 @@ struct Axpby_MV_Unroll_Functor { // are template parameters), so the compiler should evaluate these // branches at compile time. -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY <= 2 - - if (scalar_x == 0 && scalar_y == 0) { + // ************************************************************** + // Possibilities with 'scalar_x == 0' + // ************************************************************** + if constexpr (scalar_x == 0) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = ATS::zero(); - } - } - if (scalar_x == 0 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = ATS::zero(); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_b(k) * m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 0) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_y(i, k); + } + } else if constexpr (scalar_y == 1) { + // Nothing to do: Y(i,j) := Y(i,j) + } else if constexpr (scalar_y == 2) { + if (m_b.extent(0) == 1) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k); - } - } - if (scalar_x == 2 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = Kokkos::ArithTraits< + typename YMV::non_const_value_type>::zero(); + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) + m_b(k) * m_y(i, k); - } - } - -#else // KOKKOSBLAS_OPTIMIZATION_LEVEL >= 3 - - if (scalar_x == 0 && scalar_y == 0) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_b(0) * m_y(i, k); + } + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = ATS::zero(); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_b(k) * m_y(i, k); + } + } } } - if (scalar_x == 0 && scalar_y == -1) { + // ************************************************************** + // Possibilities with 'scalar_x == -1' + // ************************************************************** + else if constexpr (scalar_x == -1) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_y(i, k); - } - } - if (scalar_x == 0 && scalar_y == 1) { - return; // Y(i,j) := Y(i,j) - } - if (scalar_x == 0 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_b(k) * m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 0) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k); - } - } - if (scalar_x == -1 && scalar_y == -1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { + if (m_b.extent(0) == 1) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k); + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k) + m_b(k) * m_y(i, k); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) + m_b(k) * m_y(i, k); + } + } } } - if (scalar_x == 1 && scalar_y == 0) { + // ************************************************************** + // Possibilities with 'scalar_x == 1' + // ************************************************************** + else if constexpr (scalar_x == 1) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k); - } - } - if (scalar_x == 1 && scalar_y == -1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { + if (m_b.extent(0) == 1) { + if (m_b(0) == + Kokkos::ArithTraits::zero()) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k) + m_b(k) * m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 0) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k); + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k); - } - } - if (scalar_x == 2 && scalar_y == -1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) - m_y(i, k); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) + m_b(k) * m_y(i, k); + } + } } } - if (scalar_x == 2 && scalar_y == 1) { + // ************************************************************** + // Possibilities with 'scalar_x == 2' + // ************************************************************** + else if constexpr (scalar_x == 2) { + if constexpr (scalar_y == 0) { + if (m_a.extent(0) == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k); + } + } + } else if constexpr (scalar_y == -1) { + if (m_a.extent(0) == 1) { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) - m_y(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) - m_y(i, k); + } + } + } else if constexpr (scalar_y == 1) { + if (m_a.extent(0) == 1) { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) + m_y(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) + m_y(i, k); + } + } + } else if constexpr (scalar_y == 2) { + if (m_a.extent(0) == 1) { + if (m_b.extent(0) == 1) { + if (m_b(0) == Kokkos::ArithTraits< + typename BV::non_const_value_type>::zero()) { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(0) * m_x(i, k) + m_b(k) * m_y(i, k); + } + } + } else { + if (m_b.extent(0) == 1) { + if (m_b(0) == Kokkos::ArithTraits< + typename BV::non_const_value_type>::zero()) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a(k) * m_x(i, k) + m_b(k) * m_y(i, k); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k); + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) + m_b(0) * m_y(i, k); + } + } + } else { +#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL +#pragma unroll +#endif + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a(k) * m_x(i, k) + m_b(k) * m_y(i, k); + } + } + } } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY } }; @@ -739,22 +1047,26 @@ struct Axpby_MV_Unroll_Functor::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABscalars)" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABscalars)" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABscalars)" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::" - "Axpby_MV_Unroll_Functor: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABscalars)" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Unroll_Functor: " - "XMV and YMV must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABscalars)" + ": XMV and YMV must have rank 2."); + static_assert((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2), + "KokkosBlas::Impl::Axpby_MV_Unroll_Functor(ABscalars)" + ": scalar_x and/or scalar_y are out of range."); } KOKKOS_INLINE_FUNCTION @@ -763,168 +1075,137 @@ struct Axpby_MV_Unroll_Functor 2 - - if (scalar_x == 0 && scalar_y == 0) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = ATS::zero(); - } - } - if (scalar_x == 0 && scalar_y == -1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_y(i, k); - } - } - if (scalar_x == 0 && scalar_y == 1) { - return; // Y(i,j) := Y(i,j) - } - if (scalar_x == 0 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_b * m_y(i, k); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = -m_x(i, k) + m_b * m_y(i, k); + } } } - if (scalar_x == -1 && scalar_y == 0) { + // ************************************************************** + // Possibilities with 'scalar_x == 1' + // ************************************************************** + else if constexpr (scalar_x == 1) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k); - } - } - if (scalar_x == -1 && scalar_y == -1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == -1 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = -m_x(i, k) + m_b * m_y(i, k); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_x(i, k) + m_b * m_y(i, k); + } } } - if (scalar_x == 1 && scalar_y == 0) { + // ************************************************************** + // Possibilities with 'scalar_x == 2' + // ************************************************************** + else if constexpr (scalar_x == 2) { + if constexpr (scalar_y == 0) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k); - } - } - if (scalar_x == 1 && scalar_y == -1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a * m_x(i, k); + } + } else if constexpr (scalar_y == -1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 1) { -#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL -#pragma unroll -#endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == 1 && scalar_y == 2) { -#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL -#pragma unroll -#endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_x(i, k) + m_b * m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 0) { -#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL -#pragma unroll -#endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a * m_x(i, k); - } - } - if (scalar_x == 2 && scalar_y == -1) { -#ifdef KOKKOS_ENABLE_PRAGMA_UNROLL -#pragma unroll -#endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a * m_x(i, k) - m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 1) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a * m_x(i, k) - m_y(i, k); + } + } else if constexpr (scalar_y == 1) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a * m_x(i, k) + m_y(i, k); - } - } - if (scalar_x == 2 && scalar_y == 2) { + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a * m_x(i, k) + m_y(i, k); + } + } else if constexpr (scalar_y == 2) { #ifdef KOKKOS_ENABLE_PRAGMA_UNROLL #pragma unroll #endif - for (int k = 0; k < UNROLL; ++k) { - m_y(i, k) = m_a * m_x(i, k) + m_b * m_y(i, k); + for (int k = 0; k < UNROLL; ++k) { + m_y(i, k) = m_a * m_x(i, k) + m_b * m_y(i, k); + } } } - -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY } }; @@ -936,11 +1217,11 @@ struct Axpby_MV_Unroll_Functor void Axpby_MV_Unrolled(const execution_space& space, const AV& av, const XMV& x, const BV& bv, const YMV& y, - const SizeType startingColumn, int a = 2, int b = 2) { + const SizeType startingColumn, int scalar_x = 2, + int scalar_y = 2) { static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unrolled: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unrolled()" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Unrolled: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Unrolled()" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Unrolled: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Unrolled()" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::" - "Axpby_MV_Unrolled: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Unrolled()" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Unrolled: " - "XMV and YMV must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Unrolled()" + ": XMV and YMV must have rank 2."); + if ((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2)) { + // Ok + } else { + KokkosKernels::Impl::throw_runtime_exception( + "KokkosBlas::Impl::Axpby_MV_Unrolled()" + ": scalar_x and/or scalar_y are out of range."); + } const SizeType numRows = x.extent(0); Kokkos::RangePolicy policy(space, 0, numRows); - if (a == 0 && b == 0) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S0", policy, op); - return; - } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 0 && b == -1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S1", policy, op); - return; - } - if (a == 0 && b == 1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S2", policy, op); - return; - } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY - - if (a == 0 && b == 2) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S3", policy, op); - return; - } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - // a == -1 - if (a == -1 && b == 0) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S4", policy, op); - return; - } - if (a == -1 && b == -1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S5", policy, op); - return; - } - if (a == -1 && b == 1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S6", policy, op); - return; - } - if (a == -1 && b == 2) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S7", policy, op); - return; - } - // a == 1 - if (a == 1 && b == 0) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S8", policy, op); - return; - } - if (a == 1 && b == -1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S9", policy, op); - return; - } - if (a == 1 && b == 1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S10", policy, op); - return; - } - if (a == 1 && b == 2) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S11", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 0' + // **************************************************************** + if (scalar_x == 0) { + if (scalar_y == 0) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S0", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S1", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S2", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S3", policy, op); + } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - // a == 2 - if (a == 2 && b == 0) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S12", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == -1' + // **************************************************************** + else if (scalar_x == -1) { + if (scalar_y == 0) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S4", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S5", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S6", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S7", policy, op); + } } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 2 && b == -1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S13", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 1' + // **************************************************************** + else if (scalar_x == 1) { + if (scalar_y == 0) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S8", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S9", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S10", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S11", policy, op); + } } - if (a == 2 && b == 1) { - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S14", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 2' + // **************************************************************** + else if (scalar_x == 2) { + if (scalar_y == 0) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S12", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S13", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S14", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Unroll_Functor op( + x, y, av, bv, startingColumn); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S15", policy, op); + } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - // a and b arbitrary (not -1, 0, or 1) - Axpby_MV_Unroll_Functor op( - x, y, av, bv, startingColumn); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S15", policy, op); } // Invoke the "generic" (not unrolled) multivector functor that @@ -1092,11 +1361,11 @@ void Axpby_MV_Unrolled(const execution_space& space, const AV& av, const XMV& x, // 3. Y(i,j) = a*X(i,j) + b*Y(i,j) for a in -1,0,1 // 4. Y(i,j) = av(j)*X(i,j) + bv(j)*Y(i,j) // -// a and b come in as integers. The values -1, 0, and 1 correspond to -// the literal values of the coefficients. The value 2 tells the -// functor to use the corresponding vector of coefficients: a == 2 -// means use av, and b == 2 means use bv. Otherwise, av resp. vb are -// ignored. +// scalar_x and scalar_y come in as integers. The values -1, 0, and 1 +// correspond to the literal values of the coefficients. The value 2 +// tells the functor to use the corresponding vector of coefficients: +// - scalar_x == 2 means use av, otherwise ignore av; +// - scalar_y == 2 means use bv, otherwise ignore bv. // // Any literal coefficient of zero has BLAS semantics of ignoring the // corresponding (multi)vector entry. This does NOT apply to @@ -1106,121 +1375,109 @@ void Axpby_MV_Unrolled(const execution_space& space, const AV& av, const XMV& x, template void Axpby_MV_Generic(const execution_space& space, const AV& av, const XMV& x, - const BV& bv, const YMV& y, int a = 2, int b = 2) { + const BV& bv, const YMV& y, int scalar_x = 2, + int scalar_y = 2) { static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Generic: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Generic()" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Generic: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Generic()" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Generic: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Generic()" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::" - "Axpby_MV_Generic: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Generic()" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Generic: " - "XMV and YMV must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Generic()" + ": XMV and YMV must have rank 2."); + if ((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2)) { + // Ok + } else { + KokkosKernels::Impl::throw_runtime_exception( + "KokkosBlas::Impl::Axpby_MV_Generic()" + ": scalar_x and/or scalar_y are out of range."); + } const SizeType numRows = x.extent(0); Kokkos::RangePolicy policy(space, 0, numRows); - if (a == 0 && b == 0) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S16", policy, op); - return; - } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 0 && b == -1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S17", policy, op); - return; - } - if (a == 0 && b == 1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S18", policy, op); - return; - } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - if (a == 0 && b == 2) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S19", policy, op); - return; - } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - // a == -1 - if (a == -1 && b == 0) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S20", policy, op); - return; - } - if (a == -1 && b == -1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S21", policy, op); - return; - } - if (a == -1 && b == 1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S22", policy, op); - return; - } - if (a == -1 && b == 2) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S23", policy, op); - return; - } - // a == 1 - if (a == 1 && b == 0) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S24", policy, op); - return; - } - if (a == 1 && b == -1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S25", policy, op); - return; - } - if (a == 1 && b == 1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S26", policy, op); - return; - } - if (a == 1 && b == 2) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S27", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 0' + // **************************************************************** + if (scalar_x == 0) { + if (scalar_y == 0) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S16", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S17", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S18", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S19", policy, op); + } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - // a == 2 - if (a == 2 && b == 0) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S28", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == -1' + // **************************************************************** + else if (scalar_x == -1) { + if (scalar_y == 0) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S20", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S21", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S22", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S23", policy, op); + } } - -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - if (a == 2 && b == -1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S29", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 1' + // **************************************************************** + else if (scalar_x == 1) { + if (scalar_y == 0) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S24", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S25", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S26", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S27", policy, op); + } } - if (a == 2 && b == 1) { - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S30", policy, op); - return; + // **************************************************************** + // Possibilities with 'scalar_x == 2' + // **************************************************************** + else if (scalar_x == 2) { + if (scalar_y == 0) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S28", policy, op); + } else if (scalar_y == -1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S29", policy, op); + } else if (scalar_y == 1) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S30", policy, op); + } else if (scalar_y == 2) { + Axpby_MV_Functor op(x, y, av, bv); + Kokkos::parallel_for("KokkosBlas::Axpby::MV::S31", policy, op); + } } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - - // a and b arbitrary (not -1, 0, or 1) - Axpby_MV_Functor op(x, y, av, bv); - Kokkos::parallel_for("KokkosBlas::Axpby::MV::S31", policy, op); } // Compute any of the following, in a way optimized for X and Y @@ -1231,11 +1488,11 @@ void Axpby_MV_Generic(const execution_space& space, const AV& av, const XMV& x, // 3. Y(i,j) = a*X(i,j) + b*Y(i,j) for a in -1,0,1 // 4. Y(i,j) = av(j)*X(i,j) + bv(j)*Y(i,j) // -// a and b come in as integers. The values -1, 0, and 1 correspond to -// the literal values of the coefficients. The value 2 tells the -// functor to use the corresponding vector of coefficients: a == 2 -// means use av, and b == 2 means use bv. Otherwise, av resp. vb are -// ignored. +// scalar_x and scalar_y come in as integers. The values -1, 0, and 1 +// correspond to the literal values of the coefficients. The value 2 +// tells the functor to use the corresponding vector of coefficients: +// - scalar_x == 2 means use av, otherwise ignore av; +// - scalar_y == 2 means use bv, otherwise ignore bv. // // Any literal coefficient of zero has BLAS semantics of ignoring the // corresponding (multi)vector entry. This does NOT apply to @@ -1246,24 +1503,33 @@ template struct Axpby_MV_Invoke_Left { static void run(const execution_space& space, const AV& av, const XMV& x, - const BV& bv, const YMV& y, int a = 2, int b = 2) { + const BV& bv, const YMV& y, int scalar_x = 2, + int scalar_y = 2) { static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Invoke_Left: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Left::run()" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Invoke_Left: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Left::run()" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Invoke_Left: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Left::run()" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::" - "Axpby_MV_Invoke_Left: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Left::run()" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Invoke_Left: " - "X and Y must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Left::run()" + ": X and Y must have rank 2."); + if ((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2)) { + // Ok + } else { + KokkosKernels::Impl::throw_runtime_exception( + "KokkosBlas::Impl::Axpby_MV_Invoke_Left::run()" + ": scalar_x and/or scalar_y are out of range."); + } const SizeType numCols = x.extent(1); @@ -1279,7 +1545,7 @@ struct Axpby_MV_Invoke_Left { // subviews of av and bv, if they are Views. If they are scalars, // the functor doesn't have to do anything to them. Axpby_MV_Unrolled( - space, av, X_cur, bv, Y_cur, j, a, b); + space, av, X_cur, bv, Y_cur, j, scalar_x, scalar_y); } for (; j + 4 <= numCols; j += 4) { XMV X_cur = Kokkos::subview(x, Kokkos::ALL(), std::make_pair(j, j + 4)); @@ -1289,7 +1555,7 @@ struct Axpby_MV_Invoke_Left { // subviews of av and bv, if they are Views. If they are scalars, // the functor doesn't have to do anything to them. Axpby_MV_Unrolled( - space, av, X_cur, bv, Y_cur, j, a, b); + space, av, X_cur, bv, Y_cur, j, scalar_x, scalar_y); } for (; j < numCols; ++j) { auto x_cur = Kokkos::subview(x, Kokkos::ALL(), j); @@ -1301,24 +1567,24 @@ struct Axpby_MV_Invoke_Left { typedef decltype(x_cur) XV; typedef decltype(y_cur) YV; Axpby_Generic( - space, av, x_cur, bv, y_cur, j, a, b); + space, av, x_cur, bv, y_cur, j, scalar_x, scalar_y); } } }; -// Compute any of the following, in a way optimized for X, Y, and R +// Compute any of the following, in a way optimized for X and Y // being LayoutRight: // // 1. Y(i,j) = a*X(i,j) + b*Y(i,j) for a,b in -1,0,1 // 2. Y(i,j) = av(j)*X(i,j) + b*Y(i,j) for b in -1,0,1 -// 3. Y(i,j) = a*X(i,j) + b*Y(i,j) for a in -1,0,1 +// 3. Y(i,j) = a*X(i,j) + bv(j)*Y(i,j) for a in -1,0,1 // 4. Y(i,j) = av(j)*X(i,j) + bv(j)*Y(i,j) // -// a and b come in as integers. The values -1, 0, and 1 correspond to -// the literal values of the coefficients. The value 2 tells the -// functor to use the corresponding vector of coefficients: a == 2 -// means use av, and b == 2 means use bv. Otherwise, av resp. vb are -// ignored. +// scalar_x and scalar_y come in as integers. The values -1, 0, and 1 +// correspond to the literal values of the coefficients. The value 2 +// tells the functor to use the corresponding vector of coefficients: +// - scalar_x == 2 means use av, otherwise ignore av; +// - scalar_y == 2 means use bv, otherwise ignore bv. // // Any literal coefficient of zero has BLAS semantics of ignoring the // corresponding (multi)vector entry. This does NOT apply to @@ -1329,24 +1595,33 @@ template struct Axpby_MV_Invoke_Right { static void run(const execution_space& space, const AV& av, const XMV& x, - const BV& bv, const YMV& y, int a = 2, int b = 2) { + const BV& bv, const YMV& y, int scalar_x = 2, + int scalar_y = 2) { static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Invoke_Right: X is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Right::run()" + ": X is not a Kokkos::View."); static_assert(Kokkos::is_view::value, - "KokkosBlas::Impl::" - "Axpby_MV_Invoke_Right: Y is not a Kokkos::View."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Right::run()" + ": Y is not a Kokkos::View."); static_assert(std::is_same::value, - "KokkosBlas::Impl::Axpby_MV_Invoke_Right: Y is const. " - "It must be nonconst, because it is an output argument " - "(we have to be able to write to its entries)."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Right::run()" + ": Y must be nonconst, since it is an output argument" + " and we have to be able to write to its entries."); static_assert((int)YMV::rank == (int)XMV::rank, - "KokkosBlas::Impl::" - "Axpby_MV_Invoke_Right: X and Y must have the same rank."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Right::run()" + ": X and Y must have the same rank."); static_assert(YMV::rank == 2, - "KokkosBlas::Impl::Axpby_MV_Invoke_Right: " - "X and Y must have rank 2."); + "KokkosBlas::Impl::Axpby_MV_Invoke_Right::run()" + ": X and Y must have rank 2."); + if ((-1 <= scalar_x) && (scalar_x <= 2) && (-1 <= scalar_y) && + (scalar_y <= 2)) { + // Ok + } else { + KokkosKernels::Impl::throw_runtime_exception( + "KokkosBlas::Impl::Axpby_MV_Invoke_Right::run()" + ": scalar_x and/or scalar_y are out of range."); + } const SizeType numCols = x.extent(1); if (numCols == 1) { @@ -1355,10 +1630,10 @@ struct Axpby_MV_Invoke_Right { typedef decltype(x_0) XV; typedef decltype(y_0) YV; Axpby_Generic( - space, av, x_0, bv, y_0, 0, a, b); + space, av, x_0, bv, y_0, 0, scalar_x, scalar_y); } else { Axpby_MV_Generic( - space, av, x, bv, y, a, b); + space, av, x, bv, y, scalar_x, scalar_y); } } }; diff --git a/blas/impl/KokkosBlas1_axpby_spec.hpp b/blas/impl/KokkosBlas1_axpby_spec.hpp index da2924c9f3..3aff21e0be 100644 --- a/blas/impl/KokkosBlas1_axpby_spec.hpp +++ b/blas/impl/KokkosBlas1_axpby_spec.hpp @@ -56,6 +56,23 @@ struct axpby_eti_spec_avail { Kokkos::MemoryTraits >, \ 1> { \ enum : bool { value = true }; \ + }; \ + template <> \ + struct axpby_eti_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + 1> { \ + enum : bool { value = true }; \ }; // @@ -82,13 +99,13 @@ struct axpby_eti_spec_avail { template <> \ struct axpby_eti_spec_avail< \ EXEC_SPACE, \ - Kokkos::View, \ Kokkos::MemoryTraits >, \ Kokkos::View, \ Kokkos::MemoryTraits >, \ - Kokkos::View, \ Kokkos::MemoryTraits >, \ Kokkos::View, \ @@ -150,11 +167,17 @@ struct Axpby { }; #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY -// Full specialization for XMV and YMV rank-2 Views. +// ********************************************************************** +// Full specialization for XMV and YMV rank-2 Views: +// --> AV = anything and BV = anything +// +// If axpby() runs at a device with rank-2 XMV and rank-2 YMV, then +// the unification process forces AV = view and BV = view +// ********************************************************************** template struct Axpby { - typedef typename YMV::size_type size_type; + using size_type = typename YMV::size_type; static void axpby(const execution_space& space, const AV& av, const XMV& X, const BV& bv, const YMV& Y) { @@ -193,49 +216,83 @@ struct Axpby) { + if constexpr (AV::rank == 1) { + if (av.extent(0) == 0) { + scalar_x = 0; + } + } + } else { + using ATA = Kokkos::ArithTraits; + if (av == ATA::zero()) { + scalar_x = 0; + } else if (av == -ATA::one()) { + scalar_x = -1; + } else if (av == ATA::one()) { + scalar_x = 1; + } } - if (bv.extent(0) == 0) { - b = 0; + + int scalar_y(2); + if constexpr (Kokkos::is_view_v) { + if constexpr (BV::rank == 1) { + if (bv.extent(0) == 0) { + scalar_y = 0; + } + } + } else { + using ATB = Kokkos::ArithTraits; + if (bv == ATB::zero()) { + scalar_y = 0; + } else if (bv == -ATB::one()) { + scalar_y = -1; + } else if (bv == ATB::one()) { + scalar_y = 1; + } } if (numRows < static_cast(INT_MAX) && numRows * numCols < static_cast(INT_MAX)) { - typedef int index_type; - typedef typename std::conditional< + using index_type = int; + using Axpby_MV_Invoke_Layout = typename std::conditional< std::is_same::value, - Axpby_MV_Invoke_Right, - Axpby_MV_Invoke_Left >::type Axpby_MV_Invoke_Layout; - Axpby_MV_Invoke_Layout::run(space, av, X, bv, Y, a, b); + Axpby_MV_Invoke_Left, + Axpby_MV_Invoke_Right >::type; + Axpby_MV_Invoke_Layout::run(space, av, X, bv, Y, scalar_x, scalar_y); } else { - typedef typename XMV::size_type index_type; - typedef typename std::conditional< + using index_type = typename XMV::size_type; + using Axpby_MV_Invoke_Layout = typename std::conditional< std::is_same::value, - Axpby_MV_Invoke_Right, - Axpby_MV_Invoke_Left >::type Axpby_MV_Invoke_Layout; - Axpby_MV_Invoke_Layout::run(space, av, X, bv, Y, a, b); + Axpby_MV_Invoke_Left, + Axpby_MV_Invoke_Right >::type; + Axpby_MV_Invoke_Layout::run(space, av, X, bv, Y, scalar_x, scalar_y); } Kokkos::Profiling::popRegion(); } }; -// Partial specialization for XMV, and YMV rank-2 Views, -// and AV and BV scalars. +// ********************************************************************** +// Partial specialization for XMV and YMV rank-2 Views: +// --> AV = scalar and BV = scalar +// +// If axpby() runs at the host with rank-2 XMV and rank-2 YMV, then +// the unification process _might_ force AV = scalar and BV = scalar +// ********************************************************************** template struct Axpby { - typedef typename XMV::non_const_value_type AV; - typedef typename YMV::non_const_value_type BV; - typedef typename YMV::size_type size_type; - typedef Kokkos::ArithTraits ATA; - typedef Kokkos::ArithTraits ATB; + using AV = typename XMV::non_const_value_type; + using BV = typename YMV::non_const_value_type; + using size_type = typename YMV::size_type; + using ATA = Kokkos::ArithTraits; + using ATB = Kokkos::ArithTraits; static void axpby(const execution_space& space, const AV& alpha, const XMV& X, const BV& beta, const YMV& Y) { @@ -275,69 +332,135 @@ struct Axpby 2 - else if (alpha == -ATA::one()) { - a = -1; + scalar_x = 0; + } else if (alpha == -ATA::one()) { + scalar_x = -1; } else if (alpha == ATA::one()) { - a = 1; - } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - else { - a = 2; + scalar_x = 1; } + + int scalar_y(2); if (beta == ATB::zero()) { - b = 0; - } -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - else if (beta == -ATB::one()) { - b = -1; + scalar_y = 0; + } else if (beta == -ATB::one()) { + scalar_y = -1; } else if (beta == ATB::one()) { - b = 1; - } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - else { - b = 2; + scalar_y = 1; } if (numRows < static_cast(INT_MAX) && numRows * numCols < static_cast(INT_MAX)) { - typedef int index_type; - typedef typename std::conditional< + using index_type = int; + using Axpby_MV_Invoke_Layout = typename std::conditional< std::is_same::value, - Axpby_MV_Invoke_Right, - Axpby_MV_Invoke_Left >::type Axpby_MV_Invoke_Layout; - Axpby_MV_Invoke_Layout::run(space, alpha, X, beta, Y, a, b); + Axpby_MV_Invoke_Left, + Axpby_MV_Invoke_Right >::type; + Axpby_MV_Invoke_Layout::run(space, alpha, X, beta, Y, scalar_x, scalar_y); } else { - typedef typename XMV::size_type index_type; - typedef typename std::conditional< + using index_type = typename XMV::size_type; + using Axpby_MV_Invoke_Layout = typename std::conditional< std::is_same::value, - Axpby_MV_Invoke_Right, - Axpby_MV_Invoke_Left >::type Axpby_MV_Invoke_Layout; - Axpby_MV_Invoke_Layout::run(space, alpha, X, beta, Y, a, b); + Axpby_MV_Invoke_Left, + Axpby_MV_Invoke_Right >::type; + Axpby_MV_Invoke_Layout::run(space, alpha, X, beta, Y, scalar_x, scalar_y); } Kokkos::Profiling::popRegion(); } }; -// Partial specialization for XV and YV rank-1 Views, -// and AV and BV scalars. +// ********************************************************************** +// Full specialization for XV and YV rank-1 Views: +// --> AV = anything and BV = anything +// +// If axpby() runs at a device with rank-1 XV and rank-1 YV, then +// the unification process forces AV = view and BV = view +// ********************************************************************** +template +struct Axpby { + using size_type = typename YV::size_type; + + static void axpby(const execution_space& space, const AV& av, const XV& X, + const BV& bv, const YV& Y) { + Kokkos::Profiling::pushRegion(KOKKOSKERNELS_IMPL_COMPILE_LIBRARY + ? "KokkosBlas::axpby[ETI]" + : "KokkosBlas::axpby[noETI]"); + + size_type const numRows = X.extent(0); + + int scalar_x(2); + if constexpr (Kokkos::is_view_v) { + if constexpr (AV::rank == 1) { + if (av.extent(0) == 0) { + scalar_x = 0; + } + } + } else { + using ATA = Kokkos::ArithTraits; + if (av == ATA::zero()) { + scalar_x = 0; + } else if (av == -ATA::one()) { + scalar_x = -1; + } else if (av == ATA::one()) { + scalar_x = 1; + } + } + + int scalar_y(2); + if constexpr (Kokkos::is_view_v) { + if constexpr (BV::rank == 1) { + if (bv.extent(0) == 0) { + scalar_y = 0; + } + } + } else { + using ATB = Kokkos::ArithTraits; + if (bv == ATB::zero()) { + scalar_y = 0; + } else if (bv == -ATB::one()) { + scalar_y = -1; + } else if (bv == ATB::one()) { + scalar_y = 1; + } + } + + if (numRows < static_cast(INT_MAX)) { + using index_type = int; + Axpby_Generic( + space, av, X, bv, Y, 0, scalar_x, scalar_y); + } else { + using index_type = typename XV::size_type; + Axpby_Generic( + space, av, X, bv, Y, 0, scalar_x, scalar_y); + } + + Kokkos::Profiling::popRegion(); + } +}; + +// ********************************************************************** +// Partial specialization for XV and YV rank-1 Views: +// --> AV = scalar and BV = scalar +// +// If axpby() runs at the host with rank-1 XV and rank-1 YV, then +// the unification process forces AV = scalar and BV = scalar +// ********************************************************************** template struct Axpby { - typedef typename XV::non_const_value_type AV; - typedef typename YV::non_const_value_type BV; - typedef typename YV::size_type size_type; - typedef Kokkos::ArithTraits ATA; - typedef Kokkos::ArithTraits ATB; + using AV = typename XV::non_const_value_type; + using BV = typename YV::non_const_value_type; + using size_type = typename YV::size_type; + using ATA = Kokkos::ArithTraits; + using ATB = Kokkos::ArithTraits; static void axpby(const execution_space& space, const AV& alpha, const XV& X, const BV& beta, const YV& Y) { @@ -377,41 +500,36 @@ struct Axpby 2 - else if (alpha == -ATA::one()) { - a = -1; + scalar_x = 0; + } else if (alpha == -ATA::one()) { + scalar_x = -1; } else if (alpha == ATA::one()) { - a = 1; + scalar_x = 1; } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - int b = 2; + int scalar_y(2); if (beta == ATB::zero()) { - b = 0; - } -#if KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 - else if (beta == -ATB::one()) { - b = -1; + scalar_y = 0; + } else if (beta == -ATB::one()) { + scalar_y = -1; } else if (beta == ATB::one()) { - b = 1; + scalar_y = 1; } -#endif // KOKKOSBLAS_OPTIMIZATION_LEVEL_AXPBY > 2 if (numRows < static_cast(INT_MAX)) { - typedef int index_type; + using index_type = int; Axpby_Generic( - space, alpha, X, beta, Y, 0, a, b); + space, alpha, X, beta, Y, 0, scalar_x, scalar_y); } else { - typedef typename XV::size_type index_type; + using index_type = typename XV::size_type; Axpby_Generic( - space, alpha, X, beta, Y, 0, a, b); + space, alpha, X, beta, Y, 0, scalar_x, scalar_y); } Kokkos::Profiling::popRegion(); } @@ -437,6 +555,20 @@ struct Axpby, \ Kokkos::MemoryTraits >, \ SCALAR, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + 1, false, true>; \ + extern template struct Axpby< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ Kokkos::View, \ Kokkos::MemoryTraits >, \ 1, false, true>; @@ -448,6 +580,20 @@ struct Axpby, \ Kokkos::MemoryTraits >, \ SCALAR, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + 1, false, true>; \ + template struct Axpby< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ Kokkos::View, \ Kokkos::MemoryTraits >, \ 1, false, true>; diff --git a/blas/impl/KokkosBlas1_axpby_unification_attempt_traits.hpp b/blas/impl/KokkosBlas1_axpby_unification_attempt_traits.hpp new file mode 100644 index 0000000000..9d200e892d --- /dev/null +++ b/blas/impl/KokkosBlas1_axpby_unification_attempt_traits.hpp @@ -0,0 +1,965 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER +#ifndef KOKKOS_BLAS1_AXPBY_UNIFICATION_ATTEMPT_TRAITS_HPP_ +#define KOKKOS_BLAS1_AXPBY_UNIFICATION_ATTEMPT_TRAITS_HPP_ + +#include +#include +#include + +namespace KokkosBlas { +namespace Impl { + +// -------------------------------- + +template +constexpr int typeRank() { + if constexpr (Kokkos::is_view_v) { + return T::rank; + } + return -1; +} + +// -------------------------------- + +template +constexpr typename std::enable_if, bool>::type Tr0_val() { + return (T::rank == 0); +} + +template +constexpr typename std::enable_if, bool>::type Tr0_val() { + return false; +} + +// -------------------------------- + +template +constexpr typename std::enable_if, bool>::type Tr1s_val() { + return (T::rank == 1) && (T::rank_dynamic == 0); +} + +template +constexpr typename std::enable_if, bool>::type +Tr1s_val() { + return false; +} + +// -------------------------------- + +template +constexpr typename std::enable_if, bool>::type Tr1d_val() { + return (T::rank == 1) && (T::rank_dynamic == 1); +} + +template +constexpr typename std::enable_if, bool>::type +Tr1d_val() { + return false; +} + +// -------------------------------- + +template +struct getScalarTypeFromView { + using type = void; +}; + +template +struct getScalarTypeFromView { + using type = typename T::value_type; +}; + +// -------------------------------- + +template +struct getLayoutFromView { + using type = void; +}; + +template +struct getLayoutFromView { + using type = typename T::array_layout; +}; + +// -------------------------------- + +template +struct AxpbyUnificationAttemptTraits { + // ******************************************************************** + // Terminology: + // - variable names begin with lower case letters + // - type names begin with upper case letters + // ******************************************************************** + public: + static constexpr bool onDevice = + KokkosKernels::Impl::kk_is_gpu_exec_space(); + + private: + static constexpr bool onHost = !onDevice; + + public: + static constexpr bool a_is_scalar = !Kokkos::is_view_v; + + private: + static constexpr bool a_is_r0 = Tr0_val(); + static constexpr bool a_is_r1s = Tr1s_val(); + static constexpr bool a_is_r1d = Tr1d_val(); + + static constexpr bool x_is_r1 = Kokkos::is_view_v && (XMV::rank == 1); + static constexpr bool x_is_r2 = Kokkos::is_view_v && (XMV::rank == 2); + + public: + static constexpr bool b_is_scalar = !Kokkos::is_view_v; + + private: + static constexpr bool b_is_r0 = Tr0_val(); + static constexpr bool b_is_r1s = Tr1s_val(); + static constexpr bool b_is_r1d = Tr1d_val(); + + static constexpr bool y_is_r1 = Kokkos::is_view_v && (YMV::rank == 1); + static constexpr bool y_is_r2 = Kokkos::is_view_v && (YMV::rank == 2); + + static constexpr bool xyRank1Case = x_is_r1 && y_is_r1; + static constexpr bool xyRank2Case = x_is_r2 && y_is_r2; + + // ******************************************************************** + // Declare 'AtInputScalarTypeA_nonConst' + // ******************************************************************** + using ScalarTypeA2_onDevice = + typename getScalarTypeFromView::type; + using ScalarTypeA1_onDevice = + std::conditional_t; + + using ScalarTypeA2_onHost = + typename getScalarTypeFromView::type; + using ScalarTypeA1_onHost = + std::conditional_t; + + using AtInputScalarTypeA = + std::conditional_t; + + using AtInputScalarTypeA_nonConst = + typename std::remove_const::type; + + // ******************************************************************** + // Declare 'AtInputScalarTypeX_nonConst' + // ******************************************************************** + using AtInputScalarTypeX = typename XMV::value_type; + + using AtInputScalarTypeX_nonConst = typename XMV::non_const_value_type; + + // ******************************************************************** + // Declare 'AtInputScalarTypeB_nonConst' + // ******************************************************************** + using ScalarTypeB2_onDevice = + typename getScalarTypeFromView::type; + using ScalarTypeB1_onDevice = + std::conditional_t; + + using ScalarTypeB2_onHost = + typename getScalarTypeFromView::type; + using ScalarTypeB1_onHost = + std::conditional_t; + + using AtInputScalarTypeB = + std::conditional_t; + + using AtInputScalarTypeB_nonConst = + typename std::remove_const::type; + + // ******************************************************************** + // Declare 'AtInputScalarTypeY_nonConst' + // ******************************************************************** + using AtInputScalarTypeY = typename YMV::value_type; + + using AtInputScalarTypeY_nonConst = typename YMV::non_const_value_type; + + // ******************************************************************** + // Declare 'InternalLayoutX' and 'InternalLayoutY' + // ******************************************************************** + using InternalLayoutX = + typename KokkosKernels::Impl::GetUnifiedLayout::array_layout; + using InternalLayoutY = + typename KokkosKernels::Impl::GetUnifiedLayoutPreferring< + YMV, InternalLayoutX>::array_layout; + + // ******************************************************************** + // Declare 'InternalTypeA_tmp' + // ******************************************************************** + using AtInputLayoutA = + typename getLayoutFromView::type; + + public: + static constexpr bool atInputLayoutA_isStride = + std::is_same_v; + + private: + using InternalLayoutA = + std::conditional_t<(a_is_r1d || a_is_r1s) && atInputLayoutA_isStride, + AtInputLayoutA, InternalLayoutX>; + + static constexpr bool atInputScalarTypeA_mustRemain = + Kokkos::ArithTraits::is_complex && + !Kokkos::ArithTraits::is_complex; + + using InternalScalarTypeA = std::conditional_t< + atInputScalarTypeA_mustRemain || ((a_is_r1d || a_is_r1s) && xyRank2Case), + AtInputScalarTypeA_nonConst // Yes, keep the input scalar type + , + AtInputScalarTypeX_nonConst // Yes, instead of + // 'AtInputScalarTypeA_nonConst' + >; + + using InternalTypeA_onDevice = std::conditional_t< + a_is_scalar && b_is_scalar && onDevice, // Keep 'a' as scalar + InternalScalarTypeA, + Kokkos::View>>; + + using InternalTypeA_onHost = std::conditional_t< + (a_is_r1d || a_is_r1s) && xyRank2Case && onHost, + Kokkos::View>, + InternalScalarTypeA>; + + using InternalTypeA_tmp = + std::conditional_t; + + // ******************************************************************** + // Declare 'InternalTypeX' + // ******************************************************************** + public: + using InternalTypeX = std::conditional_t< + x_is_r2, + Kokkos::View>, + Kokkos::View>>; + + // ******************************************************************** + // Declare 'InternalTypeB_tmp' + // ******************************************************************** + private: + using AtInputLayoutB = + typename getLayoutFromView::type; + + public: + static constexpr bool atInputLayoutB_isStride = + std::is_same_v; + + private: + using InternalLayoutB = + std::conditional_t<(b_is_r1d || b_is_r1s) && atInputLayoutB_isStride, + AtInputLayoutB, InternalLayoutY>; + + static constexpr bool atInputScalarTypeB_mustRemain = + Kokkos::ArithTraits::is_complex && + !Kokkos::ArithTraits::is_complex; + + using InternalScalarTypeB = std::conditional_t< + atInputScalarTypeB_mustRemain || ((b_is_r1d || b_is_r1s) && xyRank2Case), + AtInputScalarTypeB_nonConst // Yes, keep the input scalar type + , + AtInputScalarTypeY_nonConst // Yes, instead of + // 'AtInputScalarTypeB_nonConst' + >; + + using InternalTypeB_onDevice = std::conditional_t< + a_is_scalar && b_is_scalar && onDevice, // Keep 'b' as scalar + InternalScalarTypeB, + Kokkos::View>>; + + using InternalTypeB_onHost = std::conditional_t< + (b_is_r1d || b_is_r1s) && xyRank2Case && onHost, + Kokkos::View>, + InternalScalarTypeB>; + + using InternalTypeB_tmp = + std::conditional_t; + + // ******************************************************************** + // Declare 'InternalTypeY' + // ******************************************************************** + public: + using InternalTypeY = std::conditional_t< + y_is_r2, + Kokkos::View>, + Kokkos::View>>; + + // ******************************************************************** + // Declare 'InternalTypeA': if 'InternalTypeB_tmp' is a view then + // make sure 'InternalTypeA' is a view as well + // ******************************************************************** + using InternalTypeA = std::conditional_t< + !Kokkos::is_view_v && + Kokkos::is_view_v, + Kokkos::View>, + InternalTypeA_tmp>; + + // ******************************************************************** + // Declare 'InternalTypeA_managed' with the same scalar type in + // 'InternalTypeA' + // ******************************************************************** + private: + using InternalLayoutA_managed = InternalLayoutA; + + public: + using InternalTypeA_managed = std::conditional_t< + Kokkos::is_view_v, + Kokkos::View, + void>; + + // ******************************************************************** + // Declare 'InternalTypeB' if 'InternalTypeA_tmp' is a view then + // make sure 'InternalTypeB' is a view as well + // ******************************************************************** + using InternalTypeB = std::conditional_t< + Kokkos::is_view_v && + !Kokkos::is_view_v, + Kokkos::View>, + InternalTypeB_tmp>; + + // ******************************************************************** + // Declare 'InternalTypeB_managed' with the same scalar type in + // 'InternalTypeB' + // ******************************************************************** + private: + using InternalLayoutB_managed = InternalLayoutB; + + public: + using InternalTypeB_managed = std::conditional_t< + Kokkos::is_view_v, + Kokkos::View, + void>; + + // ******************************************************************** + // Auxiliary Boolean results on internal types + // ******************************************************************** + private: + static constexpr bool internalTypeA_is_scalar = + !Kokkos::is_view_v; + static constexpr bool internalTypeA_is_r1d = Tr1d_val(); + + static constexpr bool internalTypeB_is_scalar = + !Kokkos::is_view_v; + static constexpr bool internalTypeB_is_r1d = Tr1d_val(); + + public: + static constexpr bool internalTypesAB_bothScalars = + (internalTypeA_is_scalar && internalTypeB_is_scalar); + static constexpr bool internalTypesAB_bothViews = + (internalTypeA_is_r1d && internalTypeB_is_r1d); + + // ******************************************************************** + // Routine to perform checks (both compile time and run time) + // ******************************************************************** + static void performChecks(const AV& a, const XMV& X, const BV& b, + const YMV& Y) { + // ****************************************************************** + // Check 1/6: General checks + // ****************************************************************** + static_assert( + Kokkos::is_execution_space_v, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": tExecSpace must be a valid Kokkos execution space."); + + static_assert( + (xyRank1Case && !xyRank2Case) || (!xyRank1Case && xyRank2Case), + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": one must have either both X and Y as rank 1, or both X and Y as " + "rank 2"); + + if constexpr (!Kokkos::ArithTraits< + AtInputScalarTypeY_nonConst>::is_complex) { + static_assert( + (!Kokkos::ArithTraits::is_complex) && + (!Kokkos::ArithTraits::is_complex) && + (!Kokkos::ArithTraits::is_complex), + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": if Y is not complex, then A, X and B cannot be complex"); + } + + // ****************************************************************** + // Check 2/6: YMV is valid + // ****************************************************************** + static_assert( + Kokkos::is_view::value, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": Y is not a Kokkos::View."); + static_assert( + std::is_same::value, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": Y is const. It must be nonconst, " + "because it is an output argument " + "(we must be able to write to its entries)."); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": XMV must be accessible from tExecSpace"); + + // ****************************************************************** + // Check 3/6: XMV is valid + // ****************************************************************** + static_assert( + Kokkos::is_view::value, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": X is not a Kokkos::View."); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": XMV must be accessible from tExecSpace"); + + if constexpr (xyRank1Case) { + if (X.extent(0) != Y.extent(0)) { + std::ostringstream msg; + msg << "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks(" + ")" + << ", invalid rank-1 X extent" + << ": X.extent(0) = " << X.extent(0) + << ", Y.extent(0) = " << Y.extent(0); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } else { + if ((X.extent(0) != Y.extent(0)) || (X.extent(1) != Y.extent(1))) { + std::ostringstream msg; + msg << "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks(" + ")" + << ", invalid rank-2 X extents" + << ": X.extent(0) = " << X.extent(0) + << ", X.extent(1) = " << X.extent(1) + << ", Y.extent(0) = " << Y.extent(0) + << ", Y.extent(1) = " << Y.extent(1); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } + + // ****************************************************************** + // Check 4/6: AV is valid + // ****************************************************************** + static_assert( + (a_is_scalar && !a_is_r0 && !a_is_r1s && !a_is_r1d) || + (!a_is_scalar && a_is_r0 && !a_is_r1s && !a_is_r1d) || + (!a_is_scalar && !a_is_r0 && a_is_r1s && !a_is_r1d) || + (!a_is_scalar && !a_is_r0 && !a_is_r1s && a_is_r1d), + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": 'a' must be either scalar or rank 0 or rank 1 static or rank 1 " + "dynamic"); + + if constexpr (a_is_r1d || a_is_r1s) { + if constexpr (xyRank1Case) { + if (a.extent(0) != 1) { + std::ostringstream msg; + msg << "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::" + "performChecks()" + << ": view 'a' must have extent(0) == 1 for xyRank1Case" + << ", a.extent(0) = " << a.extent(0); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } else { + if ((a.extent(0) == 1) || + (a.extent(0) == Y.extent(1))) { // Yes, 'Y' is the reference + // Ok + } else { + std::ostringstream msg; + msg << "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::" + "performChecks()" + << ": view 'a' must have extent(0) == 1 or Y.extent(1) for " + "xyRank2Case" + << ", a.extent(0) = " << a.extent(0) + << ", Y.extent(0) = " << Y.extent(0) + << ", Y.extent(1) = " << Y.extent(1); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } // if (rank1Case) else + } // if a_is_r1d + + // ****************************************************************** + // Check 5/6: BV is valid + // ****************************************************************** + static_assert( + (b_is_scalar && !b_is_r0 && !b_is_r1s && !b_is_r1d) || + (!b_is_scalar && b_is_r0 && !b_is_r1s && !b_is_r1d) || + (!b_is_scalar && !b_is_r0 && b_is_r1s && !b_is_r1d) || + (!b_is_scalar && !b_is_r0 && !b_is_r1s && b_is_r1d), + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ": 'b' must be either scalar or rank 0 or rank 1 static or rank 1 " + "dynamic"); + + if constexpr (b_is_r1d || b_is_r1s) { + if constexpr (xyRank1Case) { + if (b.extent(0) != 1) { + std::ostringstream msg; + msg << "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::" + "performChecks()" + << ": view 'b' must have extent(0) == 1 for xyRank1Case" + << ", b.extent(0) = " << b.extent(0); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } else { + if ((b.extent(0) == 1) || (b.extent(0) == Y.extent(1))) { + // Ok + } else { + std::ostringstream msg; + msg << "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::" + "performChecks()" + << ": view 'b' must have extent(0) == 1 or Y.extent(1) for " + "xyRank2Case" + << ", b.extent(0) = " << b.extent(0) + << ", Y.extent(0) = " << Y.extent(0) + << ", Y.extent(1) = " << Y.extent(1); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } // if (rank1Case) else + } // if b_is_r1d + + // ****************************************************************** + // Check 6/6: Checks on InternalTypeA, X, B, Y + // ****************************************************************** + if constexpr (onHost) { + if constexpr (xyRank1Case) { + constexpr bool internalTypeA_isOk = + (internalTypeA_is_scalar || internalTypeA_is_r1d); + static_assert( + internalTypeA_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank1Case: InternalTypeA is wrong"); + + constexpr bool internalTypeX_isOk = std::is_same_v< + InternalTypeX, + Kokkos::View>>; + static_assert( + internalTypeX_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank1Case: InternalTypeX is wrong"); + + constexpr bool internalTypeB_isOk = + (internalTypeB_is_scalar || internalTypeB_is_r1d); + static_assert( + internalTypeB_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank1Case: InternalTypeB is wrong"); + + constexpr bool internalTypeY_isOk = std::is_same_v< + InternalTypeY, + Kokkos::View>>; + static_assert( + internalTypeY_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank1Case: InternalTypeY is wrong"); + } else { + constexpr bool internalTypeA_isOk = + (internalTypeA_is_scalar || internalTypeA_is_r1d); + static_assert( + internalTypeA_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank2Case: InternalTypeA is wrong"); + + constexpr bool internalTypeX_isOk = std::is_same_v< + InternalTypeX, + Kokkos::View>>; + static_assert( + internalTypeX_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank2Case: InternalTypeX is wrong"); + + constexpr bool internalTypeB_isOk = + (internalTypeB_is_scalar || internalTypeB_is_r1d); + static_assert( + internalTypeB_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank2Case: InternalTypeB is wrong"); + + constexpr bool internalTypeY_isOk = std::is_same_v< + InternalTypeY, + Kokkos::View>>; + static_assert( + internalTypeY_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, xyRank2Case: InternalTypeY is wrong"); + } + } else { + if constexpr (xyRank1Case) { + constexpr bool internalTypeA_isOk = + internalTypeA_is_r1d || + (a_is_scalar && b_is_scalar && internalTypeA_is_scalar); + static_assert( + internalTypeA_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank1Case: InternalTypeA is wrong"); + + constexpr bool internalTypeX_isOk = std::is_same_v< + InternalTypeX, + Kokkos::View>>; + static_assert( + internalTypeX_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank1Case: InternalTypeX is wrong"); + + constexpr bool internalTypeB_isOk = + internalTypeB_is_r1d || + (a_is_scalar && b_is_scalar && internalTypeA_is_scalar); + static_assert( + internalTypeB_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank1Case: InternalTypeB is wrong"); + + constexpr bool internalTypeY_isOk = std::is_same_v< + InternalTypeY, + Kokkos::View>>; + static_assert( + internalTypeY_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank1Case: InternalTypeY is wrong"); + } else { + constexpr bool internalTypeA_isOk = + internalTypeA_is_r1d || + (a_is_scalar && b_is_scalar && internalTypeA_is_scalar); + static_assert( + internalTypeA_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank2Case: InternalTypeA is wrong"); + + constexpr bool internalTypeX_isOk = std::is_same_v< + InternalTypeX, + Kokkos::View>>; + static_assert( + internalTypeX_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank2Case: InternalTypeX is wrong"); + + constexpr bool internalTypeB_isOk = + internalTypeB_is_r1d || + (a_is_scalar && b_is_scalar && internalTypeB_is_scalar); + static_assert( + internalTypeB_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank2Case: InternalTypeB is wrong"); + + constexpr bool internalTypeY_isOk = std::is_same_v< + InternalTypeY, + Kokkos::View>>; + static_assert( + internalTypeY_isOk, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, xyRank2Case: InternalTypeY is wrong"); + } + } + + if constexpr (onHost) { + // **************************************************************** + // We are in the 'onHost' case, with 2 possible subcases:: + // + // 1) xyRank1Case, with the following possible situations: + // - [InternalTypeA, B] = [S_a, S_b], or + // - [InternalTypeA, B] = [view, view] + // + // or + // + // 2) xyRank2Case, with the following possible situations: + // - [InternalTypeA, B] = [S_a, S_b], or + // - [InternalTypeA, B] = [view, view] + // **************************************************************** + static_assert( + internalTypesAB_bothScalars || internalTypesAB_bothViews, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onHost, invalid combination of types"); + } // If onHost + else if constexpr (onDevice) { + // **************************************************************** + // We are in the 'onDevice' case, with 2 possible subcases: + // + // 1) xyRank1Case, with the following possible situations: + // - [InternalTypeA, B] = [S_a, S_b], or + // - [InternalTypeA, B] = [view, view] + // + // or + // + // 2) xyRank2Case, with the following possible situations: + // - [InternalTypeA, B] = [S_a, S_b], or + // - [InternalTypeA, B] = [view, view] + // **************************************************************** + static_assert( + internalTypesAB_bothViews || + (a_is_scalar && b_is_scalar && internalTypesAB_bothScalars), + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", onDevice, invalid combination of types"); + } + + if constexpr (xyRank2Case && (a_is_r1d || a_is_r1s) && + atInputLayoutA_isStride) { + static_assert( + std::is_same_v< + typename getLayoutFromView< + InternalTypeA, Kokkos::is_view_v>::type, + Kokkos::LayoutStride>, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", xyRank2Case: coeff 'a' is rank-1 and has LayoutStride at input" + ", but no LayoutStride internally"); + } + + if constexpr (xyRank2Case && (b_is_r1d || b_is_r1s) && + atInputLayoutB_isStride) { + static_assert( + std::is_same_v< + typename getLayoutFromView< + InternalTypeB, Kokkos::is_view_v>::type, + Kokkos::LayoutStride>, + "KokkosBlas::Impl::AxpbyUnificationAttemptTraits::performChecks()" + ", xyRank2Case: coeff 'b' is rank-1 and has LayoutStride at input" + ", but no LayoutStride internally"); + } + } // Constructor + + // ******************************************************************** + // Routine to print information on input variables and internal variables + // ******************************************************************** +#if (KOKKOSKERNELS_DEBUG_LEVEL > 0) + static void printInformation(std::ostream& os, std::string const& headerMsg) { + os << headerMsg << ": AV = " + << typeid(AV).name() + //<< ", AV::const_data_type = " << typeid(AV::const_data_type).name() + //<< ", AV::non_const_data_type = " << + // typeid(AV::non_const_data_type).name() + << ", AtInputScalarTypeA = " << typeid(AtInputScalarTypeA).name() + << ", isConst = " + << std::is_const_v << ", isComplex = " + << Kokkos::ArithTraits::is_complex + << ", AtInputScalarTypeA_nonConst = " + << typeid(AtInputScalarTypeA_nonConst).name() + << ", InternalTypeA = " << typeid(InternalTypeA).name() << "\n" + << ", InternalTypeA_managed = " << typeid(InternalTypeA_managed).name() + << "\n" + << "\n" + << "XMV = " << typeid(XMV).name() << "\n" + << "XMV::value_type = " << typeid(typename XMV::value_type).name() + << "\n" + << "XMV::const_data_type = " + << typeid(typename XMV::const_data_type).name() << "\n" + << "XMV::non_const_data_type = " + << typeid(typename XMV::non_const_data_type).name() << "\n" + << "AtInputScalarTypeX = " << typeid(AtInputScalarTypeX).name() << "\n" + << "isConst = " << std::is_const_v << "\n" + << "isComplex = " + << Kokkos::ArithTraits::is_complex << "\n" + << "AtInputScalarTypeX_nonConst = " + << typeid(AtInputScalarTypeX_nonConst).name() << "\n" + << "InternalTypeX = " << typeid(InternalTypeX).name() << "\n" + << "\n" + << "BV = " + << typeid(BV).name() + //<< ", BV::const_data_type = " << typeid(BV::const_data_type).name() + //<< ", BV::non_const_data_type = " << + // typeid(BV::non_const_data_type).name() + << ", AtInputScalarTypeB = " << typeid(AtInputScalarTypeB).name() + << ", isConst = " + << std::is_const_v << ", isComplex = " + << Kokkos::ArithTraits::is_complex + << ", AtInputScalarTypeB_nonConst = " + << typeid(AtInputScalarTypeB_nonConst).name() + << ", InternalTypeB = " << typeid(InternalTypeB).name() << "\n" + << ", InternalTypeB_managed = " << typeid(InternalTypeB_managed).name() + << "\n" + << "\n" + << "YMV = " << typeid(YMV).name() << "\n" + << "YMV::value_type = " << typeid(typename YMV::value_type).name() + << "\n" + << "YMV::const_data_type = " + << typeid(typename YMV::const_data_type).name() << "\n" + << "YMV::non_const_data_type = " + << typeid(typename YMV::non_const_data_type).name() << "\n" + << "AtInputScalarTypeY = " << typeid(AtInputScalarTypeY).name() << "\n" + << "isConst = " << std::is_const_v << "\n" + << "isComplex = " + << Kokkos::ArithTraits::is_complex << "\n" + << "AtInputScalarTypeY_nonConst = " + << typeid(AtInputScalarTypeY_nonConst).name() << "\n" + << "InternalTypeY = " << typeid(InternalTypeY).name() << "\n" + << std::endl; + } +#endif + +}; // struct AxpbyUnificationAttemptTraits + +// -------------------------------- + +template +struct getScalarValueFromVariableAtHost { + getScalarValueFromVariableAtHost() { + static_assert((rankT == -1) || (rankT == 0) || (rankT == 1), + "Generic struct should not have been invoked!"); + } +}; + +template +struct getScalarValueFromVariableAtHost { + static T getValue(T const& var) { return var; } +}; + +template +struct getScalarValueFromVariableAtHost { + static typename T::value_type getValue(T const& var) { return var(); } +}; + +template +struct getScalarValueFromVariableAtHost { + static typename T::value_type getValue(T const& var) { return var[0]; } +}; + +// -------------------------------- + +template +size_t getAmountOfScalarsInCoefficient(T const& coeff) { + size_t result = 1; + if constexpr (Kokkos::is_view_v) { + if constexpr (T::rank == 1) { + result = coeff.extent(0); + } + } + return result; +} + +// -------------------------------- + +template +size_t getStrideInCoefficient(T const& coeff) { + size_t result = 1; + if constexpr (Kokkos::is_view_v) { + if constexpr ((T::rank == 1) && (std::is_same_v)) { + result = coeff.stride_0(); + } + } + return result; +} + +// -------------------------------- + +template +static void populateRank1Stride1ViewWithScalarOrNonStrideView( + T_in const& coeff_in, T_out& coeff_out) { + // *********************************************************************** + // 'coeff_out' is assumed to be rank-1, of LayoutLeft or LayoutRight + // + // One has to be careful with situations like the following: + // - a coeff_in that deals with 'double', and + // - a coeff_out deals with 'complex' + // *********************************************************************** + using ScalarOutType = + typename std::remove_const::type; + + if constexpr (!Kokkos::is_view_v) { + // ********************************************************************* + // 'coeff_in' is scalar + // ********************************************************************* + ScalarOutType scalarValue(coeff_in); + Kokkos::deep_copy(coeff_out, scalarValue); + } else if constexpr (T_in::rank == 0) { + // ********************************************************************* + // 'coeff_in' is rank-0 + // ********************************************************************* + typename T_in::HostMirror h_coeff_in("h_coeff_in"); + Kokkos::deep_copy(h_coeff_in, coeff_in); + ScalarOutType scalarValue(h_coeff_in()); + Kokkos::deep_copy(coeff_out, scalarValue); + } else { + // ********************************************************************* + // 'coeff_in' is also rank-1 + // ********************************************************************* + if (coeff_out.extent(0) != coeff_in.extent(0)) { + std::ostringstream msg; + msg << "In populateRank1Stride1ViewWithScalarOrNonStrideView()" + << ": 'in' and 'out' should have the same extent(0)" + << ", T_in = " << typeid(T_in).name() + << ", coeff_in.label() = " << coeff_in.label() + << ", coeff_in.extent(0) = " << coeff_in.extent(0) + << ", T_out = " << typeid(T_out).name() + << ", coeff_out.label() = " << coeff_out.label() + << ", coeff_out.extent(0) = " << coeff_out.extent(0); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + + using ScalarInType = + typename std::remove_const::type; + if constexpr (std::is_same_v) { + coeff_out = coeff_in; + } else if (coeff_out.extent(0) == 1) { + typename T_in::HostMirror h_coeff_in("h_coeff_in"); + Kokkos::deep_copy(h_coeff_in, coeff_in); + ScalarOutType scalarValue(h_coeff_in[0]); + Kokkos::deep_copy(coeff_out, scalarValue); + } else { + std::ostringstream msg; + msg << "In populateRank1Stride1ViewWithScalarOrNonStrideView()" + << ": scalar types 'in' and 'out' should be the same" + << ", T_in = " << typeid(T_in).name() + << ", ScalarInType = " << typeid(ScalarInType).name() + << ", coeff_in.label() = " << coeff_in.label() + << ", coeff_in.extent(0) = " << coeff_in.extent(0) + << ", T_out = " << typeid(T_out).name() + << ", ScalarOutType = " << typeid(ScalarOutType).name() + << ", coeff_out.label() = " << coeff_out.label() + << ", coeff_out.extent(0) = " << coeff_out.extent(0); + KokkosKernels::Impl::throw_runtime_exception(msg.str()); + } + } +} // populateRank1Stride1ViewWithScalarOrNonStrideView() + +} // namespace Impl +} // namespace KokkosBlas + +#endif // KOKKOS_BLAS1_AXPBY_UNIFICATION_ATTEMPT_TRAITS_HPP_ diff --git a/blas/impl/KokkosBlas2_gemv_impl.hpp b/blas/impl/KokkosBlas2_gemv_impl.hpp index 730f88602a..dc0f531583 100644 --- a/blas/impl/KokkosBlas2_gemv_impl.hpp +++ b/blas/impl/KokkosBlas2_gemv_impl.hpp @@ -199,10 +199,9 @@ struct SingleLevelTransposeGEMV { }; // Single-level parallel version of GEMV. -template -void singleLevelGemv(const typename AViewType::execution_space& space, - const char trans[], +template +void singleLevelGemv(const ExecutionSpace& space, const char trans[], typename AViewType::const_value_type& alpha, const AViewType& A, const XViewType& x, typename YViewType::const_value_type& beta, @@ -222,9 +221,8 @@ void singleLevelGemv(const typename AViewType::execution_space& space, static_assert(std::is_integral::value, "IndexType must be an integer"); - using y_value_type = typename YViewType::non_const_value_type; - using execution_space = typename AViewType::execution_space; - using policy_type = Kokkos::RangePolicy; + using y_value_type = typename YViewType::non_const_value_type; + using policy_type = Kokkos::RangePolicy; using AlphaCoeffType = typename AViewType::non_const_value_type; using BetaCoeffType = typename YViewType::non_const_value_type; @@ -442,8 +440,8 @@ struct TwoLevelGEMV_LayoutRightTag {}; // --------------------------------------------------------------------------------------------- // Functor for a two-level parallel_reduce version of GEMV (non-transpose), // designed for performance on GPU. Kernel depends on the layout of A. -template +template struct TwoLevelGEMV { using y_value_type = typename YViewType::non_const_value_type; using AlphaCoeffType = typename AViewType::non_const_value_type; @@ -453,9 +451,8 @@ struct TwoLevelGEMV { std::is_same::value, float, y_value_type>::type; - using execution_space = typename AViewType::execution_space; - using policy_type = Kokkos::TeamPolicy; - using member_type = typename policy_type::member_type; + using policy_type = Kokkos::TeamPolicy; + using member_type = typename policy_type::member_type; TwoLevelGEMV(const AlphaCoeffType& alpha, const AViewType& A, const XViewType& x, const BetaCoeffType& beta, @@ -564,7 +561,8 @@ struct TwoLevelGEMV { // transpose GEMV. The functor uses parallel-for over the columns of the input // matrix A and each team uses parallel-reduce over the row of its column. // The output vector y is the reduction result. -template struct TwoLevelTransposeGEMV { using y_value_type = typename YViewType::non_const_value_type; @@ -575,9 +573,8 @@ struct TwoLevelTransposeGEMV { std::is_same::value, float, y_value_type>::type; - using execution_space = typename AViewType::execution_space; - using policy_type = Kokkos::TeamPolicy; - using member_type = typename policy_type::member_type; + using policy_type = Kokkos::TeamPolicy; + using member_type = typename policy_type::member_type; TwoLevelTransposeGEMV(const AlphaCoeffType& alpha, const AViewType& A, const XViewType& x, const BetaCoeffType& beta, @@ -637,10 +634,9 @@ struct TwoLevelTransposeGEMV { }; // Two-level parallel version of GEMV. -template -void twoLevelGemv(const typename AViewType::execution_space& space, - const char trans[], +template +void twoLevelGemv(const ExecutionSpace& space, const char trans[], typename AViewType::const_value_type& alpha, const AViewType& A, const XViewType& x, typename YViewType::const_value_type& beta, @@ -661,9 +657,8 @@ void twoLevelGemv(const typename AViewType::execution_space& space, "IndexType must be an integer"); using y_value_type = typename YViewType::non_const_value_type; - using execution_space = typename AViewType::execution_space; - using team_policy_type = Kokkos::TeamPolicy; - using range_policy_type = Kokkos::RangePolicy; + using team_policy_type = Kokkos::TeamPolicy; + using range_policy_type = Kokkos::RangePolicy; using Kokkos::ArithTraits; using KAT = ArithTraits; @@ -704,19 +699,19 @@ void twoLevelGemv(const typename AViewType::execution_space& space, using layout_tag = typename std::conditional::type; - using tagged_policy = Kokkos::TeamPolicy; - using functor_type = - TwoLevelGEMV; + using tagged_policy = Kokkos::TeamPolicy; + using functor_type = TwoLevelGEMV; functor_type functor(alpha, A, x, beta, y); tagged_policy team; - if (isLayoutLeft) { + if constexpr (isLayoutLeft) { using AccumScalar = typename std::conditional< std::is_same::value || std::is_same::value, float, y_value_type>::type; size_t sharedPerTeam = 32 * sizeof(AccumScalar); IndexType numTeams = (A.extent(0) + 31) / 32; - tagged_policy temp(1, 1); + tagged_policy temp(space, 1, 1); temp.set_scratch_size(0, Kokkos::PerTeam(sharedPerTeam)); int teamSize = temp.team_size_recommended(functor, Kokkos::ParallelForTag()); @@ -727,7 +722,7 @@ void twoLevelGemv(const typename AViewType::execution_space& space, // FIXME SYCL: team_size_recommended() returns too big of a team size. // Kernel hangs with 1024 threads on XEHP. #ifdef KOKKOS_ENABLE_SYCL - if (std::is_same::value) { + if (std::is_same::value) { if (teamSize > 256) teamSize = 256; } #endif @@ -749,16 +744,18 @@ void twoLevelGemv(const typename AViewType::execution_space& space, } else if (tr == 'T') { // transpose, and not conj transpose team_policy_type team(space, A.extent(1), Kokkos::AUTO); - using functor_type = TwoLevelTransposeGEMV; + using functor_type = + TwoLevelTransposeGEMV; functor_type functor(alpha, A, x, beta, y); Kokkos::parallel_for("KokkosBlas::gemv[twoLevelTranspose]", team, functor); } else if (tr == 'C' || tr == 'H') { // conjugate transpose team_policy_type team(space, A.extent(1), Kokkos::AUTO); - using functor_type = TwoLevelTransposeGEMV; + using functor_type = + TwoLevelTransposeGEMV; functor_type functor(alpha, A, x, beta, y); Kokkos::parallel_for("KokkosBlas::gemv[twoLevelTranspose]", team, functor); @@ -769,11 +766,11 @@ void twoLevelGemv(const typename AViewType::execution_space& space, // generalGemv: use 1 level (Range) or 2 level (Team) implementation, // depending on whether execution space is CPU or GPU. enable_if makes sure // unused kernels are not instantiated. -template ()>::type* = nullptr> -void generalGemvImpl(const typename AViewType::execution_space& space, - const char trans[], + ExecutionSpace>()>::type* = nullptr> +void generalGemvImpl(const ExecutionSpace& space, const char trans[], typename AViewType::const_value_type& alpha, const AViewType& A, const XViewType& x, typename YViewType::const_value_type& beta, @@ -781,11 +778,11 @@ void generalGemvImpl(const typename AViewType::execution_space& space, singleLevelGemv(space, trans, alpha, A, x, beta, y); } -template ()>::type* = nullptr> -void generalGemvImpl(const typename AViewType::execution_space& space, - const char trans[], + ExecutionSpace>()>::type* = nullptr> +void generalGemvImpl(const ExecutionSpace& space, const char trans[], typename AViewType::const_value_type& alpha, const AViewType& A, const XViewType& x, typename YViewType::const_value_type& beta, diff --git a/blas/impl/KokkosBlas2_gemv_spec.hpp b/blas/impl/KokkosBlas2_gemv_spec.hpp index 08842a61c0..97e6e2717e 100644 --- a/blas/impl/KokkosBlas2_gemv_spec.hpp +++ b/blas/impl/KokkosBlas2_gemv_spec.hpp @@ -104,10 +104,10 @@ struct GEMV { // Prefer int as the index type, but use a larger type if needed. if (numRows < static_cast(INT_MAX) && numCols < static_cast(INT_MAX)) { - generalGemvImpl(space, trans, alpha, - A, x, beta, y); + generalGemvImpl( + space, trans, alpha, A, x, beta, y); } else { - generalGemvImpl( + generalGemvImpl( space, trans, alpha, A, x, beta, y); } Kokkos::Profiling::popRegion(); diff --git a/blas/impl/KokkosBlas2_syr2_impl.hpp b/blas/impl/KokkosBlas2_syr2_impl.hpp new file mode 100644 index 0000000000..69284e9547 --- /dev/null +++ b/blas/impl/KokkosBlas2_syr2_impl.hpp @@ -0,0 +1,369 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_IMPL_HPP_ +#define KOKKOSBLAS2_SYR2_IMPL_HPP_ + +#include "KokkosKernels_config.h" +#include "Kokkos_Core.hpp" +#include "KokkosKernels_ExecSpaceUtils.hpp" +#include "Kokkos_ArithTraits.hpp" + +namespace KokkosBlas { +namespace Impl { + +// Functor for the thread parallel version of SYR2. +// This functor parallelizes over rows of the input matrix A. +template +struct ThreadParallelSYR2 { + using AlphaCoeffType = typename AViewType::non_const_value_type; + using XComponentType = typename XViewType::non_const_value_type; + using YComponentType = typename YViewType::non_const_value_type; + using AComponentType = typename AViewType::non_const_value_type; + + ThreadParallelSYR2(const AlphaCoeffType& alpha, const XViewType& x, + const YViewType& y, const AViewType& A) + : alpha_(alpha), x_(x), y_(y), A_(A) { + // Nothing to do + } + + KOKKOS_INLINE_FUNCTION void operator()(const IndexType& i) const { + if (alpha_ == Kokkos::ArithTraits::zero()) { + // Nothing to do + } else if ((x_(i) == Kokkos::ArithTraits::zero()) && + (y_(i) == Kokkos::ArithTraits::zero())) { + // Nothing to do + } else { + const XComponentType x_fixed(x_(i)); + const YComponentType y_fixed(y_(i)); + const IndexType N(A_.extent(1)); + + if constexpr (tJustTranspose) { + if (x_fixed != Kokkos::ArithTraits::zero()) { + for (IndexType j = 0; j < N; ++j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * x_fixed * y_(j)); + } + } + } + if (y_fixed != Kokkos::ArithTraits::zero()) { + for (IndexType j = 0; j < N; ++j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * y_fixed * x_(j)); + } + } + } + } else { + if (x_fixed != Kokkos::ArithTraits::zero()) { + for (IndexType j = 0; j < N; ++j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType( + alpha_ * x_fixed * + Kokkos::ArithTraits::conj(y_(j))); + } + } + } + if (y_fixed != Kokkos::ArithTraits::zero()) { + for (IndexType j = 0; j < N; ++j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType( + Kokkos::ArithTraits::conj(alpha_) * y_fixed * + Kokkos::ArithTraits::conj(x_(j))); + } + } + } + } + } + } + + private: + AlphaCoeffType alpha_; + typename XViewType::const_type x_; + typename YViewType::const_type y_; + AViewType A_; +}; + +// Thread parallel version of SYR2. +template +void threadParallelSyr2(const ExecutionSpace& space, + const typename AViewType::const_value_type& alpha, + const XViewType& x, const YViewType& y, + const AViewType& A) { + static_assert(std::is_integral::value, + "IndexType must be an integer"); + + using AlphaCoeffType = typename AViewType::non_const_value_type; + + if (x.extent(0) == 0) { + // no entries to update + } else if (y.extent(0) == 0) { + // no entries to update + } else if (alpha == Kokkos::ArithTraits::zero()) { + // no entries to update + } else { + Kokkos::RangePolicy rangePolicy(space, 0, + A.extent(0)); + ThreadParallelSYR2 + functor(alpha, x, y, A); + Kokkos::parallel_for("KokkosBlas::syr2[threadParallel]", rangePolicy, + functor); + } +} + +struct TeamParallelSYR2_LayoutLeftTag {}; +struct TeamParallelSYR2_LayoutRightTag {}; + +// --------------------------------------------------------------------------------------------- + +// Functor for the team parallel version of SYR2, designed for +// performance on GPUs. The kernel depends on the layout of A. +template +struct TeamParallelSYR2 { + using AlphaCoeffType = typename AViewType::non_const_value_type; + using XComponentType = typename XViewType::non_const_value_type; + using YComponentType = typename YViewType::non_const_value_type; + using AComponentType = typename AViewType::non_const_value_type; + + using policy_type = Kokkos::TeamPolicy; + using member_type = typename policy_type::member_type; + + TeamParallelSYR2(const AlphaCoeffType& alpha, const XViewType& x, + const YViewType& y, const AViewType& A) + : alpha_(alpha), x_(x), y_(y), A_(A) { + // Nothing to do + } + + public: + // LayoutLeft version: one team per column + KOKKOS_INLINE_FUNCTION void operator()(TeamParallelSYR2_LayoutLeftTag, + const member_type& team) const { + if (alpha_ == Kokkos::ArithTraits::zero()) { + // Nothing to do + } else { + const IndexType j(team.league_rank()); + if ((x_(j) == Kokkos::ArithTraits::zero()) && + (y_(j) == Kokkos::ArithTraits::zero())) { + // Nothing to do + } else { + const IndexType M(A_.extent(0)); + if constexpr (tJustTranspose) { + const XComponentType x_fixed(x_(j)); + const YComponentType y_fixed(y_(j)); + if (y_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, M), [&](const IndexType& i) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * x_(i) * y_fixed); + } + }); + } + if (x_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, M), [&](const IndexType& i) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * y_(i) * x_fixed); + } + }); + } + } else { + const XComponentType x_fixed( + Kokkos::ArithTraits::conj(x_(j))); + const YComponentType y_fixed( + Kokkos::ArithTraits::conj(y_(j))); + if (y_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, M), [&](const IndexType& i) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * x_(i) * y_fixed); + } + }); + } + if (x_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, M), [&](const IndexType& i) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType( + Kokkos::ArithTraits::conj(alpha_) * + y_(i) * x_fixed); + } + }); + } + } + } + } + } + + // LayoutRight version: one team per row + KOKKOS_INLINE_FUNCTION void operator()(TeamParallelSYR2_LayoutRightTag, + const member_type& team) const { + if (alpha_ == Kokkos::ArithTraits::zero()) { + // Nothing to do + } else { + const IndexType i(team.league_rank()); + if ((x_(i) == Kokkos::ArithTraits::zero()) && + (y_(i) == Kokkos::ArithTraits::zero())) { + // Nothing to do + } else { + const IndexType N(A_.extent(1)); + const XComponentType x_fixed(x_(i)); + const YComponentType y_fixed(y_(i)); + if constexpr (tJustTranspose) { + if (x_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, N), [&](const IndexType& j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * x_fixed * y_(j)); + } + }); + } + if (y_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, N), [&](const IndexType& j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType(alpha_ * y_fixed * x_(j)); + } + }); + } + } else { + if (x_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, N), [&](const IndexType& j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType( + alpha_ * x_fixed * + Kokkos::ArithTraits::conj(y_(j))); + } + }); + } + if (y_fixed != Kokkos::ArithTraits::zero()) { + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, N), [&](const IndexType& j) { + if (((tJustUp == true) && (i <= j)) || + ((tJustUp == false) && (i >= j))) { + A_(i, j) += AComponentType( + Kokkos::ArithTraits::conj(alpha_) * + y_fixed * + Kokkos::ArithTraits::conj(x_(j))); + } + }); + } + } + } + } + } + + private: + AlphaCoeffType alpha_; + typename XViewType::const_type x_; + typename YViewType::const_type y_; + AViewType A_; +}; + +// Team parallel version of SYR2. +template +void teamParallelSyr2(const ExecutionSpace& space, + const typename AViewType::const_value_type& alpha, + const XViewType& x, const YViewType& y, + const AViewType& A) { + static_assert(std::is_integral::value, + "IndexType must be an integer"); + + using AlphaCoeffType = typename AViewType::non_const_value_type; + + if (x.extent(0) == 0) { + // no entries to update + return; + } else if (y.extent(0) == 0) { + // no entries to update + return; + } else if (alpha == Kokkos::ArithTraits::zero()) { + // no entries to update + return; + } + + constexpr bool isLayoutLeft = + std::is_same::value; + using layout_tag = + typename std::conditional::type; + using TeamPolicyType = Kokkos::TeamPolicy; + TeamPolicyType teamPolicy; + if (isLayoutLeft) { + // LayoutLeft: one team per column + teamPolicy = TeamPolicyType(space, A.extent(1), Kokkos::AUTO); + } else { + // LayoutRight: one team per row + teamPolicy = TeamPolicyType(space, A.extent(0), Kokkos::AUTO); + } + + TeamParallelSYR2 + functor(alpha, x, y, A); + Kokkos::parallel_for("KokkosBlas::syr2[teamParallel]", teamPolicy, functor); +} + +// --------------------------------------------------------------------------------------------- + +// generalSyr2Impl(): +// - use thread parallel code (rangePolicy) if execution space is CPU; +// - use team parallel code (teamPolicy) if execution space is GPU. +// +// The 'enable_if' makes sure unused kernels are not instantiated. + +template ()>::type* = nullptr> +void generalSyr2Impl(const ExecutionSpace& space, + const typename AViewType::const_value_type& alpha, + const XViewType& x, const YViewType& y, + const AViewType& A) { + threadParallelSyr2(space, alpha, x, y, A); +} + +template ()>::type* = nullptr> +void generalSyr2Impl(const ExecutionSpace& space, + const typename AViewType::const_value_type& alpha, + const XViewType& x, const YViewType& y, + const AViewType& A) { + teamParallelSyr2(space, alpha, x, y, A); +} + +} // namespace Impl +} // namespace KokkosBlas + +#endif // KOKKOSBLAS2_SYR2_IMPL_HPP_ diff --git a/blas/impl/KokkosBlas2_syr2_spec.hpp b/blas/impl/KokkosBlas2_syr2_spec.hpp new file mode 100644 index 0000000000..01637ba1d4 --- /dev/null +++ b/blas/impl/KokkosBlas2_syr2_spec.hpp @@ -0,0 +1,180 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_SPEC_HPP_ +#define KOKKOSBLAS2_SYR2_SPEC_HPP_ + +#include "KokkosKernels_config.h" +#include "Kokkos_Core.hpp" + +#if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY +#include +#endif + +namespace KokkosBlas { +namespace Impl { +// Specialization struct which defines whether a specialization exists +template +struct syr2_eti_spec_avail { + enum : bool { value = false }; +}; +} // namespace Impl +} // namespace KokkosBlas + +// +// Macro for declaration of full specialization availability +// KokkosBlas::Impl::SYR2. This is NOT for users!!! All the declarations of full +// specializations go in this header file. We may spread out definitions (see +// _INST macro below) across one or more .cpp files. +// +#define KOKKOSBLAS2_SYR2_ETI_SPEC_AVAIL(SCALAR, LAYOUT, EXEC_SPACE, MEM_SPACE) \ + template <> \ + struct syr2_eti_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +// Include the actual specialization declarations +#include +#include + +namespace KokkosBlas { +namespace Impl { + +// +// syr2 +// + +// Implementation of KokkosBlas::syr2. +template < + class ExecutionSpace, class XViewType, class YViewType, class AViewType, + bool tpl_spec_avail = syr2_tpl_spec_avail::value, + bool eti_spec_avail = syr2_eti_spec_avail::value> +struct SYR2 { + static void syr2(const ExecutionSpace& space, const char trans[], + const char uplo[], + const typename AViewType::const_value_type& alpha, + const XViewType& x, const YViewType& y, const AViewType& A) +#if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY + { + Kokkos::Profiling::pushRegion(KOKKOSKERNELS_IMPL_COMPILE_LIBRARY + ? "KokkosBlas::syr2[ETI]" + : "KokkosBlas::syr2[noETI]"); + + typedef typename AViewType::size_type size_type; + const size_type numRows = A.extent(0); + const size_type numCols = A.extent(1); + + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); + bool justUp = (uplo[0] == 'U') || (uplo[0] == 'u'); + + // Prefer int as the index type, but use a larsyr2 type if needed. + if ((numRows < static_cast(INT_MAX)) && + (numCols < static_cast(INT_MAX))) { + if (justTranspose) { + if (justUp) { + generalSyr2Impl(space, alpha, x, y, A); + } else { + generalSyr2Impl(space, alpha, x, y, A); + } + } else { + if (justUp) { + generalSyr2Impl(space, alpha, x, y, A); + } else { + generalSyr2Impl(space, alpha, x, y, A); + } + } + } else { + if (justTranspose) { + if (justUp) { + generalSyr2Impl(space, alpha, x, y, A); + } else { + generalSyr2Impl(space, alpha, x, y, A); + } + } else { + if (justUp) { + generalSyr2Impl(space, alpha, x, y, A); + } else { + generalSyr2Impl(space, alpha, x, y, A); + } + } + } + + Kokkos::Profiling::popRegion(); + } +#else + ; +#endif // if !defined(KOKKOSKERNELS_ETI_ONLY) || + // KOKKOSKERNELS_IMPL_COMPILE_LIBRARY +}; + +} // namespace Impl +} // namespace KokkosBlas + +// +// Macro for declaration of full specialization of KokkosBlas::Impl::SYR2. +// This is NOT for users!!! +// All the declarations of full specializations go in this header file. +// We may spread out definitions (see _DEF macro below) across one or more .cpp +// files. +// +#define KOKKOSBLAS2_SYR2_ETI_SPEC_DECL(SCALAR, LAYOUT, EXEC_SPACE, MEM_SPACE) \ + extern template struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + false, true>; + +#define KOKKOSBLAS2_SYR2_ETI_SPEC_INST(SCALAR, LAYOUT, EXEC_SPACE, MEM_SPACE) \ + template struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + false, true>; + +#include + +#endif // KOKKOSBLAS2_SYR2_SPEC_HPP_ diff --git a/blas/impl/KokkosBlas2_syr_impl.hpp b/blas/impl/KokkosBlas2_syr_impl.hpp index 439ed588db..685ca75997 100644 --- a/blas/impl/KokkosBlas2_syr_impl.hpp +++ b/blas/impl/KokkosBlas2_syr_impl.hpp @@ -94,7 +94,7 @@ void threadParallelSyr(const ExecutionSpace& space, A.extent(0)); ThreadParallelSYR functor(alpha, x, A); - Kokkos::parallel_for("KokkosBlas::syr[thredParallel]", rangePolicy, + Kokkos::parallel_for("KokkosBlas::syr[threadParallel]", rangePolicy, functor); } } diff --git a/blas/src/KokkosBlas1_axpby.hpp b/blas/src/KokkosBlas1_axpby.hpp index 2f59cb4cce..5cd03dd7c7 100644 --- a/blas/src/KokkosBlas1_axpby.hpp +++ b/blas/src/KokkosBlas1_axpby.hpp @@ -17,124 +17,262 @@ #ifndef KOKKOSBLAS1_AXPBY_HPP_ #define KOKKOSBLAS1_AXPBY_HPP_ +#if (KOKKOSKERNELS_DEBUG_LEVEL > 0) +#include +#endif // KOKKOSKERNELS_DEBUG_LEVEL + #include #include #include #include +#include // axpby() accepts both scalar coefficients a and b, and vector // coefficients (apply one for each column of the input multivectors). // This traits class helps axpby() select the correct specialization -// of AV and BV (the type of a resp. b) for invoking the +// of AV (type of 'a') and BV (type of 'b') for invoking the // implementation. namespace KokkosBlas { /// \brief Computes Y := a*X + b*Y /// -/// This function is non-blocking and thread safe. +/// This function is non-blocking and thread-safe. /// -/// \tparam execution_space a Kokkos execution space where the kernel will run. -/// \tparam AV 1-D or 2-D Kokkos::View specialization. -/// \tparam XMV 1-D or 2-D Kokkos::View specialization. -/// \tparam BV 1-D or 2-D Kokkos::View specialization. -/// \tparam YMV 1-D or 2-D Kokkos::View specialization. It must have -/// the same rank as XMV. +/// \tparam execution_space The type of execution space where the kernel +/// will run. +/// \tparam AV Scalar or 0-D or 1-D Kokkos::View. +/// \tparam XMV 1-D Kokkos::View or 2-D Kokkos::View. It +/// must have the same rank as YMV. +/// \tparam BV Scalar or 0-D or 1-D Kokkos::View. +/// \tparam YMV 1-D or 2-D Kokkos::View. /// -/// \param space [in] the execution space instance on which the kernel will run. -/// \param a [in] view of type AV, scaling parameter for X. -/// \param X [in] input view of type XMV. -/// \param b [in] view of type BV, scaling parameter for Y. -/// \param Y [in/out] view of type YMV in which the results will be stored. +/// \param exec_space [in] The execution space instance on which the kernel +/// will run. +/// \param a [in] Input of type AV: +/// - scaling parameter for 1-D or 2-D X, +/// - scaling parameters for 2-D X. +/// \param X [in] View of type XMV. It must have the same +/// extent(s) as Y. +/// \param b [in] input of type BV: +/// - scaling parameter for 1-D or 2-D Y, +/// - scaling parameters for 2-D Y. +/// \param Y [in/out] View of type YMV in which the results will be +/// stored. template -void axpby(const execution_space& space, const AV& a, const XMV& X, const BV& b, - const YMV& Y) { - static_assert(Kokkos::is_execution_space_v, - "KokkosBlas::axpby: execution_space must be a valid Kokkos " - "execution space."); - static_assert(Kokkos::is_view::value, - "KokkosBlas::axpby: " - "X is not a Kokkos::View."); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::axpby: XMV must be accessible from execution_space"); - static_assert(Kokkos::is_view::value, - "KokkosBlas::axpby: " - "Y is not a Kokkos::View."); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::axpby: XMV must be accessible from execution_space"); - static_assert(std::is_same::value, - "KokkosBlas::axpby: Y is const. It must be nonconst, " - "because it is an output argument " - "(we must be able to write to its entries)."); - static_assert(int(YMV::rank) == int(XMV::rank), - "KokkosBlas::axpby: " - "X and Y must have the same rank."); - static_assert(YMV::rank == 1 || YMV::rank == 2, - "KokkosBlas::axpby: " - "XMV and YMV must either have rank 1 or rank 2."); - - // Check compatibility of dimensions at run time. - if (X.extent(0) != Y.extent(0) || X.extent(1) != Y.extent(1)) { - std::ostringstream os; - os << "KokkosBlas::axpby: Dimensions of X and Y do not match: " - << "X: " << X.extent(0) << " x " << X.extent(1) << ", Y: " << Y.extent(0) - << " x " << Y.extent(1); - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } +void axpby(const execution_space& exec_space, const AV& a, const XMV& X, + const BV& b, const YMV& Y) { + using AxpbyTraits = + Impl::AxpbyUnificationAttemptTraits; + using InternalTypeA = typename AxpbyTraits::InternalTypeA; + using InternalTypeX = typename AxpbyTraits::InternalTypeX; + using InternalTypeB = typename AxpbyTraits::InternalTypeB; + using InternalTypeY = typename AxpbyTraits::InternalTypeY; + + // ********************************************************************** + // Perform compile time checks and run time checks. + // ********************************************************************** + AxpbyTraits::performChecks(a, X, b, Y); +#if (KOKKOSKERNELS_DEBUG_LEVEL > 1) + AxpbyTraits::printInformation(std::cout, "axpby(), unif information"); +#endif // KOKKOSKERNELS_DEBUG_LEVEL + + // ********************************************************************** + // Call Impl::Axpby<...>::axpby(...) + // ********************************************************************** + InternalTypeX internal_X = X; + InternalTypeY internal_Y = Y; + + if constexpr (AxpbyTraits::internalTypesAB_bothScalars) { + // ******************************************************************** + // The unification logic applies the following general rules: + // 1) In a 'onHost' case, it makes the internal types for 'a' and 'b' + // to be both scalars (hence the name 'internalTypesAB_bothScalars') + // 2) In a 'onDevice' case, it makes the internal types for 'a' and 'b' + // to be Kokkos views. For performance reasons in Trilinos, the only + // exception for this rule is when the input types for both 'a' and + // 'b' are already scalars, in which case the internal types for 'a' + // and 'b' become scalars as well, eventually changing precision in + // order to match the precisions of 'X' and 'Y'. + // ******************************************************************** + if constexpr (AxpbyTraits::a_is_scalar && AxpbyTraits::b_is_scalar && + AxpbyTraits::onDevice) { + // ****************************************************************** + // We are in the exception situation for rule 2 + // ****************************************************************** + InternalTypeA internal_a(a); + InternalTypeA internal_b(b); - using UnifiedXLayout = - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout; - using UnifiedYLayout = - typename KokkosKernels::Impl::GetUnifiedLayoutPreferring< - YMV, UnifiedXLayout>::array_layout; - - // Create unmanaged versions of the input Views. XMV and YMV may be - // rank 1 or rank 2. AV and BV may be either rank-1 Views, or - // scalar values. - using XMV_Internal = Kokkos::View >; - using YMV_Internal = Kokkos::View >; - using AV_Internal = - typename KokkosKernels::Impl::GetUnifiedScalarViewType::type; - using BV_Internal = - typename KokkosKernels::Impl::GetUnifiedScalarViewType::type; - - AV_Internal a_internal = a; - XMV_Internal X_internal = X; - BV_Internal b_internal = b; - YMV_Internal Y_internal = Y; - - Impl::Axpby::axpby(space, a_internal, X_internal, b_internal, - Y_internal); + Impl::Axpby::axpby(exec_space, internal_a, internal_X, + internal_b, internal_Y); + } else { + // ****************************************************************** + // We are in rule 1, that is, we are in a 'onHost' case now + // ****************************************************************** + InternalTypeA internal_a(Impl::getScalarValueFromVariableAtHost< + AV, Impl::typeRank()>::getValue(a)); + InternalTypeB internal_b(Impl::getScalarValueFromVariableAtHost< + BV, Impl::typeRank()>::getValue(b)); + + Impl::Axpby::axpby(exec_space, internal_a, internal_X, + internal_b, internal_Y); + } + } else if constexpr (AxpbyTraits::internalTypesAB_bothViews) { + constexpr bool internalLayoutA_isStride( + std::is_same_v); + constexpr bool internalLayoutB_isStride( + std::is_same_v); + + const size_t numScalarsA(Impl::getAmountOfScalarsInCoefficient(a)); + const size_t numScalarsB(Impl::getAmountOfScalarsInCoefficient(b)); + + const size_t strideA(Impl::getStrideInCoefficient(a)); + const size_t strideB(Impl::getStrideInCoefficient(b)); + + Kokkos::LayoutStride layoutStrideA{numScalarsA, strideA}; + Kokkos::LayoutStride layoutStrideB{numScalarsB, strideB}; + + InternalTypeA internal_a; + InternalTypeB internal_b; + + if constexpr (internalLayoutA_isStride) { + // ****************************************************************** + // Prepare internal_a + // ****************************************************************** + typename AxpbyTraits::InternalTypeA_managed managed_a("managed_a", + layoutStrideA); + if constexpr (AxpbyTraits::atInputLayoutA_isStride) { + Kokkos::deep_copy(managed_a, a); + } else { + Impl::populateRank1Stride1ViewWithScalarOrNonStrideView(a, managed_a); + } + internal_a = managed_a; + + if constexpr (internalLayoutB_isStride) { + // **************************************************************** + // Prepare internal_b + // **************************************************************** + typename AxpbyTraits::InternalTypeB_managed managed_b("managed_b", + layoutStrideB); + if constexpr (AxpbyTraits::atInputLayoutB_isStride) { + Kokkos::deep_copy(managed_b, b); + } else { + Impl::populateRank1Stride1ViewWithScalarOrNonStrideView(b, managed_b); + } + internal_b = managed_b; + + // **************************************************************** + // Call Impl::Axpby<...>::axpby(...) + // **************************************************************** + Impl::Axpby::axpby(exec_space, internal_a, + internal_X, internal_b, + internal_Y); + } else { + // **************************************************************** + // Prepare internal_b + // **************************************************************** + typename AxpbyTraits::InternalTypeB_managed managed_b("managed_b", + numScalarsB); + if constexpr (AxpbyTraits::atInputLayoutB_isStride) { + Kokkos::deep_copy(managed_b, b); + } else { + Impl::populateRank1Stride1ViewWithScalarOrNonStrideView(b, managed_b); + } + internal_b = managed_b; + + // **************************************************************** + // Call Impl::Axpby<...>::axpby(...) + // **************************************************************** + Impl::Axpby::axpby(exec_space, internal_a, + internal_X, internal_b, + internal_Y); + } + } else { + // ****************************************************************** + // Prepare internal_a + // ****************************************************************** + typename AxpbyTraits::InternalTypeA_managed managed_a("managed_a", + numScalarsA); + if constexpr (AxpbyTraits::atInputLayoutA_isStride) { + Kokkos::deep_copy(managed_a, a); + } else { + Impl::populateRank1Stride1ViewWithScalarOrNonStrideView(a, managed_a); + } + internal_a = managed_a; + + if constexpr (internalLayoutB_isStride) { + // **************************************************************** + // Prepare internal_b + // **************************************************************** + typename AxpbyTraits::InternalTypeB_managed managed_b("managed_b", + layoutStrideB); + if constexpr (AxpbyTraits::atInputLayoutB_isStride) { + Kokkos::deep_copy(managed_b, b); + } else { + Impl::populateRank1Stride1ViewWithScalarOrNonStrideView(b, managed_b); + } + internal_b = managed_b; + + // **************************************************************** + // Call Impl::Axpby<...>::axpby(...) + // **************************************************************** + Impl::Axpby::axpby(exec_space, internal_a, + internal_X, internal_b, + internal_Y); + } else { + // **************************************************************** + // Prepare internal_b + // **************************************************************** + typename AxpbyTraits::InternalTypeB_managed managed_b("managed_b", + numScalarsB); + if constexpr (AxpbyTraits::atInputLayoutB_isStride) { + Kokkos::deep_copy(managed_b, b); + } else { + Impl::populateRank1Stride1ViewWithScalarOrNonStrideView(b, managed_b); + } + internal_b = managed_b; + + // **************************************************************** + // Call Impl::Axpby<...>::axpby(...) + // **************************************************************** + Impl::Axpby::axpby(exec_space, internal_a, + internal_X, internal_b, + internal_Y); + } + } + } } /// \brief Computes Y := a*X + b*Y /// -/// This function is non-blocking and thread-safe +/// This function is non-blocking and thread-safe. /// The kernel is executed in the default stream/queue /// associated with the execution space of XMV. /// -/// \tparam AV 1-D or 2-D Kokkos::View specialization. -/// \tparam XMV 1-D or 2-D Kokkos::View specialization. -/// \tparam BV 1-D or 2-D Kokkos::View specialization. -/// \tparam YMV 1-D or 2-D Kokkos::View specialization. It must have -/// the same rank as XMV. +/// \tparam AV Scalar or 0-D Kokkos::View or 1-D Kokkos::View. +/// \tparam XMV 1-D Kokkos::View or 2-D Kokkos::View. It must +/// have the same rank as YMV. +/// \tparam BV Scalar or 0-D Kokkos::View or 1-D Kokkos::View. +/// \tparam YMV 1-D Kokkos::View or 2-D Kokkos::View. /// -/// \param a [in] view of type AV, scaling parameter for X. -/// \param X [in] input view of type XMV. -/// \param b [in] view of type BV, scaling parameter for Y. -/// \param Y [in/out] view of type YMV in which the results will be stored. +/// \param a [in] Input of type AV: +/// - scaling parameter for 1-D or 2-D X, +/// - scaling parameters for 2-D X. +/// \param X [in] View of type XMV. It must have the same +/// extent(s) as Y. +/// \param b [in] input of type BV: +/// - scaling parameter for 1-D or 2-D Y, +/// - scaling parameters for 2-D Y. +/// \param Y [in/out] View of type YMV in which the results will be +/// stored. template void axpby(const AV& a, const XMV& X, const BV& b, const YMV& Y) { axpby(typename XMV::execution_space{}, a, X, b, Y); @@ -142,39 +280,49 @@ void axpby(const AV& a, const XMV& X, const BV& b, const YMV& Y) { /// \brief Computes Y := a*X + Y /// -/// This function is non-blocking and thread-safe +/// This function is non-blocking and thread-safe. /// -/// \tparam execution_space a Kokkos execution space where the kernel will run. -/// \tparam AV 1-D or 2-D Kokkos::View specialization. -/// \tparam XMV 1-D or 2-D Kokkos::View specialization. -/// \tparam YMV 1-D or 2-D Kokkos::View specialization. It must have -/// the same rank as XMV. +/// \tparam execution_space The type of execution space where the kernel +/// will run. +/// \tparam AV Scalar or 0-D or 1-D Kokkos::View. +/// \tparam XMV 1-D or 2-D Kokkos::View. It must have the +/// the same rank as YMV. +/// \tparam YMV 1-D or 2-D Kokkos::View. /// -/// \param space [in] the execution space instance on which the kernel will run. -/// \param a [in] view of type AV, scaling parameter for X. -/// \param X [in] input view of type XMV. -/// \param Y [in/out] view of type YMV in which the results will be stored. +/// \param exec_space [in] The execution space instance on which the kernel +/// will run. +/// \param a [in] Input of type AV: +/// - scaling parameter for 1-D or 2-D X, +/// - scaling parameters for 2-D X. +/// \param X [in] View of type XMV. It must have the same +/// extent(s) as Y. +/// \param Y [in/out] View of type YMV in which the results will be +/// stored. template -void axpy(const execution_space& space, const AV& a, const XMV& X, +void axpy(const execution_space& exec_space, const AV& a, const XMV& X, const YMV& Y) { - axpby(space, a, X, + axpby(exec_space, a, X, Kokkos::ArithTraits::one(), Y); } /// \brief Computes Y := a*X + Y /// -/// This function is non-blocking and thread-safe +/// This function is non-blocking and thread-safe. /// The kernel is executed in the default stream/queue /// associated with the execution space of XMV. /// -/// \tparam AV 1-D or 2-D Kokkos::View specialization. -/// \tparam XMV 1-D or 2-D Kokkos::View specialization. -/// \tparam YMV 1-D or 2-D Kokkos::View specialization. It must have -/// the same rank as XMV. +/// \tparam AV Scalar or 0-D Kokkos::View or 1-D Kokkos::View. +/// \tparam XMV 1-D Kokkos::View or 2-D Kokkos::View. It must +/// have the same rank as YMV. +/// \tparam YMV 1-D Kokkos::View or 2-D Kokkos::View. /// -/// \param a [in] view of type AV, scaling parameter for X. -/// \param X [in] input view of type XMV. -/// \param Y [in/out] view of type YMV in which the results will be stored. +/// \param a [in] Input of type AV: +/// - scaling parameter for 1-D or 2-D X, +/// - scaling parameters for 2-D X. +/// \param X [in] View of type XMV. It must have the same +/// extent(s) as Y. +/// \param Y [in/out] View of type YMV in which the results will be +/// stored. template void axpy(const AV& a, const XMV& X, const YMV& Y) { axpy(typename XMV::execution_space{}, a, X, Y); diff --git a/blas/src/KokkosBlas1_dot.hpp b/blas/src/KokkosBlas1_dot.hpp index ebccce7d7c..aa995836eb 100644 --- a/blas/src/KokkosBlas1_dot.hpp +++ b/blas/src/KokkosBlas1_dot.hpp @@ -96,25 +96,37 @@ dot(const execution_space& space, const XVector& x, const YVector& y) { Kokkos::View>; - result_type result{}; - RVector_Result R = RVector_Result(&result); XVector_Internal X = x; YVector_Internal Y = y; - // Even though RVector is the template parameter, Dot::dot has an overload - // that accepts RVector_Internal (with the special accumulator, if dot_type is - // 32-bit precision). Impl::Dot needs to support both cases, and it's easier - // to do this with overloading than by extending the ETI to deal with two - // different scalar types. - Impl::DotSpecialAccumulator::dot(space, R, - X, Y); - space.fence(); - // mfh 22 Jan 2020: We need the line below because - // Kokkos::complex lacks a constructor that takes a - // Kokkos::complex with U != T. - return Kokkos::Details::CastPossiblyComplex::cast( - result); + bool useFallback = false; + if (useFallback) { + // Even though RVector is the template parameter, Dot::dot has an overload + // that accepts RVector_Internal (with the special accumulator, if dot_type + // is 32-bit precision). Impl::Dot needs to support both cases, and it's + // easier to do this with overloading than by extending the ETI to deal with + // two different scalar types. + result_type result{}; + RVector_Result R = RVector_Result(&result); + Impl::DotSpecialAccumulator::dot(space, + R, X, + Y); + space.fence(); + // mfh 22 Jan 2020: We need the line below because + // Kokkos::complex lacks a constructor that takes a + // Kokkos::complex with U != T. + return Kokkos::Details::CastPossiblyComplex::cast( + result); + } else { + dot_type result{}; + RVector_Internal R = RVector_Internal(&result); + Impl::Dot::dot(space, R, X, Y); + space.fence(); + return Kokkos::Details::CastPossiblyComplex::cast( + result); + } } /// \brief Return the dot product of the two vectors x and y. diff --git a/blas/src/KokkosBlas1_swap.hpp b/blas/src/KokkosBlas1_swap.hpp index 26c529f3b7..9ddcd106df 100644 --- a/blas/src/KokkosBlas1_swap.hpp +++ b/blas/src/KokkosBlas1_swap.hpp @@ -26,12 +26,12 @@ namespace KokkosBlas { /// \brief Swaps the entries of vectors x and y. /// /// \tparam execution_space an execution space to perform parallel work -/// \tparam XVector Type of the first vector x; a 1-D Kokkos::View. -/// \tparam YVector Type of the first vector y; a 1-D Kokkos::View. +/// \tparam XVector Type of the first vector x; a rank 1 Kokkos::View. +/// \tparam YVector Type of the first vector y; a rank 1 Kokkos::View. /// /// \param space [in] execution space passed to execution policies -/// \param x [in/out] 1-D View. -/// \param y [in/out] 1-D View. +/// \param x [in/out] rank 1 View. +/// \param y [in/out] rank 1 View. /// /// Swaps x and y. Note that this is akin to performing a deep_copy, swapping /// pointers inside view can only be performed if no aliasing, subviews, etc... @@ -100,11 +100,11 @@ void swap(execution_space const& space, XVector const& x, YVector const& y) { /// \brief Swaps the entries of vectors x and y. /// -/// \tparam XVector Type of the first vector x; a 1-D Kokkos::View. -/// \tparam YVector Type of the first vector y; a 1-D Kokkos::View. +/// \tparam XVector Type of the first vector x; a rank 1 Kokkos::View. +/// \tparam YVector Type of the first vector y; a rank 1 Kokkos::View. /// -/// \param x [in/out] 1-D View. -/// \param y [in/out] 1-D View. +/// \param x [in/out] rank 1 View. +/// \param y [in/out] rank 1 View. /// /// This function is non-blocking unless the underlying TPL requested /// at compile time is itself blocking. Note that the kernel will be diff --git a/blas/src/KokkosBlas2_ger.hpp b/blas/src/KokkosBlas2_ger.hpp index fbfc9c1f98..8650577faf 100644 --- a/blas/src/KokkosBlas2_ger.hpp +++ b/blas/src/KokkosBlas2_ger.hpp @@ -17,6 +17,8 @@ #ifndef KOKKOSBLAS2_GER_HPP_ #define KOKKOSBLAS2_GER_HPP_ +#include "KokkosKernels_helpers.hpp" + #include namespace KokkosBlas { @@ -42,15 +44,6 @@ template ::assignable, - "AViewType memory space must be assignable from XViewType"); - static_assert( - Kokkos::SpaceAccessibility::assignable, - "AViewType memory space must be assignable from YViewType"); - static_assert( Kokkos::SpaceAccessibility::accessible, diff --git a/blas/src/KokkosBlas2_syr.hpp b/blas/src/KokkosBlas2_syr.hpp index af66767ab4..00d1d8b3de 100644 --- a/blas/src/KokkosBlas2_syr.hpp +++ b/blas/src/KokkosBlas2_syr.hpp @@ -17,6 +17,8 @@ #ifndef KOKKOSBLAS2_SYR_HPP_ #define KOKKOSBLAS2_SYR_HPP_ +#include "KokkosKernels_helpers.hpp" + #include namespace KokkosBlas { @@ -64,11 +66,6 @@ template void syr(const ExecutionSpace& space, const char trans[], const char uplo[], const typename AViewType::const_value_type& alpha, const XViewType& x, const AViewType& A) { - static_assert( - Kokkos::SpaceAccessibility::assignable, - "AViewType memory space must be assignable from XViewType"); - static_assert( Kokkos::SpaceAccessibility::accessible, diff --git a/blas/src/KokkosBlas2_syr2.hpp b/blas/src/KokkosBlas2_syr2.hpp new file mode 100644 index 0000000000..d86abd31c1 --- /dev/null +++ b/blas/src/KokkosBlas2_syr2.hpp @@ -0,0 +1,238 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_HPP_ +#define KOKKOSBLAS2_SYR2_HPP_ + +#include "KokkosKernels_helpers.hpp" + +#include +#include + +namespace KokkosBlas { + +/// \brief Rank-1 update (just lower portion or just upper portion) of a +/// matrix A that is: +/// - symmetric, A += alpha * x * y^T + alpha * y * x^T, or +/// - Hermitian, A += alpha * x * y^H + conj(alpha) * y * x^H. +/// +/// Important note 1: this routine encapsulates the syr2() and her2() +/// routines specified in BLAS documentations. It has the purpose of +/// updating a symmetric (or Hermitian) matrix A in such a way that +/// it continues to be symmetric (or Hermitian). +/// +/// Important note 2: however, this routine will honor all parameters +/// passed to it, even if A is not symmetric or not Hermitian. +/// Moreover, this routine will always compute either the lower +/// portion or the upper portion (per user's request) of the final +/// matrix A. So, in order to obtain meaningful results, the user +/// must make sure to follow the conditions specified in the +/// "important note 1" above. +/// +/// Important note 3: if TPL is enabled, this routine will call the +/// third party library BLAS routines whenever the parameters passed +/// are consistent with the parameters expected by the corresponding +/// TPL routine. If not, then this routine will route the execution +/// to the kokkos-kernels implementation, thus honoring all +/// parameters passed, as stated in the "important note 2" above. +/// +/// Important note 4: Regarding parameter types: +/// - If A has components of real type (float or double), then: +/// - alpha must be of real type as well, +/// - components of x must be of real type as well, and +/// - components of y must be of real type as well. +/// - If A has components of complex type (complex or +/// complex), then: +/// - alpha must be of complex type as well (it may have zero +/// imaginary part, no problem), +/// - components of x may be of real type or complex type, and +/// - components of y may be of real type or complex type. +/// +/// \tparam ExecutionSpace The type of execution space +/// \tparam XViewType Input vector, as a 1-D Kokkos::View +/// \tparam YViewType Input vector, as a 1-D Kokkos::View +/// \tparam AViewType Input/Output matrix, as a 2-D Kokkos::View +/// +/// \param space [in] Execution space instance on which to run the kernel. +/// This may contain information about which stream to +/// run on. +/// \param trans [in] "T" or "t" for transpose, "H" or "h" for Hermitian. +/// Only the first character is taken into account. +/// \param uplo [in] "U" or "u" for upper portion, "L" or "l" for lower +/// portion. Only the first character is taken into +/// account. +/// \param alpha [in] Input coefficient of x * x^{T,H} +/// \param x [in] Input vector, as a 1-D Kokkos::View +/// \param y [in] Input vector, as a 1-D Kokkos::View +/// \param A [in/out] Output matrix, as a nonconst 2-D Kokkos::View +template +void syr2(const ExecutionSpace& space, const char trans[], const char uplo[], + const typename AViewType::const_value_type& alpha, const XViewType& x, + const YViewType& y, const AViewType& A) { + static_assert( + Kokkos::SpaceAccessibility::accessible, + "AViewType memory space must be accessible from ExecutionSpace"); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "XViewType memory space must be accessible from ExecutionSpace"); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "YViewType memory space must be accessible from ExecutionSpace"); + + static_assert(Kokkos::is_view::value, + "AViewType must be a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "XViewType must be a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "YViewType must be a Kokkos::View."); + + static_assert(static_cast(AViewType::rank()) == 2, + "AViewType must have rank 2."); + static_assert(static_cast(XViewType::rank()) == 1, + "XViewType must have rank 1."); + static_assert(static_cast(YViewType::rank()) == 1, + "YViewType must have rank 1."); + + // Check compatibility of dimensions at run time. + if ((A.extent(0) == A.extent(1)) && (A.extent(0) == x.extent(0)) && + (A.extent(0) == y.extent(0))) { + // Ok + } else { + std::ostringstream os; + os << "KokkosBlas::syr2: Dimensions of A, x: " + << "A is " << A.extent(0) << " by " << A.extent(1) << ", x has size " + << x.extent(0) << ", y has size " << y.extent(0); + KokkosKernels::Impl::throw_runtime_exception(os.str()); + } + + if ((trans[0] == 'T') || (trans[0] == 't') || (trans[0] == 'H') || + (trans[0] == 'h')) { + // Ok + } else { + std::ostringstream os; + os << "KokkosBlas2::syr2(): invalid trans[0] = '" << trans[0] + << "'. It must be equalt to 'T' or 't' or 'H' or 'h'"; + KokkosKernels::Impl::throw_runtime_exception(os.str()); + } + + if ((uplo[0] == 'U') || (uplo[0] == 'u') || (uplo[0] == 'L') || + (uplo[0] == 'l')) { + // Ok + } else { + std::ostringstream oss; + oss << "KokkosBlas2::syr2(): invalid uplo[0] = " << uplo[0] + << "'. It must be equalt to 'U' or 'u' or 'L' or 'l'"; + throw std::runtime_error(oss.str()); + } + + if ((A.extent(0) == 0) || (A.extent(1) == 0)) { + return; + } + + using ALayout = typename AViewType::array_layout; + + // Minimize the number of Impl::SYR2 instantiations, by standardizing + // on particular View specializations for its template parameters. + typedef Kokkos::View::array_layout, + typename XViewType::device_type, + Kokkos::MemoryTraits > + XVT; + + typedef Kokkos::View::array_layout, + typename YViewType::device_type, + Kokkos::MemoryTraits > + YVT; + + typedef Kokkos::View > + AVT; + + Impl::SYR2::syr2(space, trans, uplo, alpha, x, + y, A); +} + +/// \brief Rank-1 update (just lower portion or just upper portion) of a +/// matrix A that is: +/// - symmetric, A += alpha * x * y^T + alpha * y * x^T, or +/// - Hermitian, A += alpha * x * y^H + conj(alpha) * y * x^H. +/// +/// Important note 1: this routine encapsulates the syr2() and her2() +/// routines specified in BLAS documentations. It has the purpose of +/// updating a symmetric (or Hermitian) matrix A in such a way that +/// it continues to be symmetric (or Hermitian). +/// +/// Important note 2: however, this routine will honor all parameters +/// passed to it, even if A is not symmetric or not Hermitian. +/// Moreover, this routine will always compute either the lower +/// portion or the upper portion (per user's request) of the final +/// matrix A. So, in order to obtain meaningful results, the user +/// must make sure to follow the conditions specified in the +/// "important note 1" above. +/// +/// Important note 3: if TPL is enabled, this routine will call the +/// third party library BLAS routines whenever the parameters passed +/// are consistent with the parameters expected by the corresponding +/// TPL routine. If not, then this routine will route the execution +/// to the kokkos-kernels implementation, thus honoring all +/// parameters passed, as stated in the "important note 2" above. +/// +/// Important note 4: Regarding parameter types: +/// - If A has components of real type (float or double), then: +/// - alpha must be of real type as well, +/// - components of x must be of real type as well, and +/// - components of y must be of real type as well. +/// - If A has components of complex type (complex or +/// complex), then: +/// - alpha must be of complex type as well (it may have zero +/// imaginary part, no problem), +/// - components of x may be of real type or complex type, and +/// - components of y may be of real type or complex type. +/// +/// \tparam XViewType Input vector, as a 1-D Kokkos::View +/// \tparam YViewType Input vector, as a 1-D Kokkos::View +/// \tparam AViewType Input/Output matrix, as a 2-D Kokkos::View +/// +/// \param trans [in] "T" or "t" for transpose, "H" or "h" for Hermitian. +/// Only the first character is taken into account. +/// \param uplo [in] "U" or "u" for upper portion, "L" or "l" for lower +/// portion. Only the first character is taken into +/// account. +/// \param alpha [in] Input coefficient of x * x^{T,H} +/// \param x [in] Input vector, as a 1-D Kokkos::View +/// \param y [in] Input vector, as a 1-D Kokkos::View +/// \param A [in/out] Output matrix, as a nonconst 2-D Kokkos::View +template +void syr2(const char trans[], const char uplo[], + const typename AViewType::const_value_type& alpha, const XViewType& x, + const YViewType& y, const AViewType& A) { + const typename AViewType::execution_space space = + typename AViewType::execution_space(); + syr2( + space, trans, uplo, alpha, x, y, A); +} + +} // namespace KokkosBlas + +#endif // KOKKOSBLAS2_SYR2_HPP_ diff --git a/blas/tpls/KokkosBlas1_dot_tpl_spec_avail.hpp b/blas/tpls/KokkosBlas1_dot_tpl_spec_avail.hpp index ca2139980d..3ba8f063b4 100644 --- a/blas/tpls/KokkosBlas1_dot_tpl_spec_avail.hpp +++ b/blas/tpls/KokkosBlas1_dot_tpl_spec_avail.hpp @@ -52,18 +52,22 @@ KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_BLAS(double, Kokkos::LayoutLeft, Kokkos::HostSpace) KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_BLAS(float, Kokkos::LayoutLeft, Kokkos::HostSpace) + +// TODO: we met difficuties in FindTPLMKL.cmake to set the BLAS library properly +// such that the test in CheckHostBlasReturnComplex.cmake could not be +// compiled and run to give a correct answer on KK_BLAS_RESULT_AS_POINTER_ARG. +// This resulted in segfault in dot() with MKL and complex. +// So we just temporarily disable it until FindTPLMKL.cmake is fixed. +#if !defined(KOKKOSKERNELS_ENABLE_TPL_MKL) KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::HostSpace) KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::HostSpace) +#endif #endif -// cuBLAS -#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS -// double -#define KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS(SCALAR, LAYOUT, EXECSPACE, \ - MEMSPACE) \ +#define KOKKOSBLAS1_DOT_TPL_SPEC(SCALAR, LAYOUT, EXECSPACE, MEMSPACE) \ template <> \ struct dot_tpl_spec_avail< \ EXECSPACE, \ @@ -77,19 +81,27 @@ KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, enum : bool { value = true }; \ }; -KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS(double, Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace) -KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS(float, Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace) -KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, - Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace) -KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, - Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace) +#define KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL(LAYOUT, EXECSPACE, MEMSPACE) \ + KOKKOSBLAS1_DOT_TPL_SPEC(float, LAYOUT, EXECSPACE, MEMSPACE) \ + KOKKOSBLAS1_DOT_TPL_SPEC(double, LAYOUT, EXECSPACE, MEMSPACE) \ + KOKKOSBLAS1_DOT_TPL_SPEC(Kokkos::complex, LAYOUT, EXECSPACE, \ + MEMSPACE) \ + KOKKOSBLAS1_DOT_TPL_SPEC(Kokkos::complex, LAYOUT, EXECSPACE, MEMSPACE) +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS +KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL(Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaSpace) +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCBLAS +KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL(Kokkos::LayoutLeft, Kokkos::HIP, + Kokkos::HIPSpace) #endif +#if defined(KOKKOSKERNELS_ENABLE_TPL_MKL) && defined(KOKKOS_ENABLE_SYCL) +KOKKOSBLAS1_DOT_TPL_SPEC_AVAIL(Kokkos::LayoutLeft, Kokkos::Experimental::SYCL, + Kokkos::Experimental::SYCLDeviceUSMSpace) +#endif } // namespace Impl } // namespace KokkosBlas #endif diff --git a/blas/tpls/KokkosBlas1_dot_tpl_spec_decl.hpp b/blas/tpls/KokkosBlas1_dot_tpl_spec_decl.hpp index 718e32f14c..ace26ebdbd 100644 --- a/blas/tpls/KokkosBlas1_dot_tpl_spec_decl.hpp +++ b/blas/tpls/KokkosBlas1_dot_tpl_spec_decl.hpp @@ -39,71 +39,40 @@ inline void dot_print_specialization() { namespace KokkosBlas { namespace Impl { - -#define KOKKOSBLAS1_DDOT_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS(LAYOUT, KOKKOS_TYPE, TPL_TYPE, \ + MEMSPACE, ETI_SPEC_AVAIL) \ template \ - struct Dot< \ - ExecSpace, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void dot(const ExecSpace& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_BLAS,double]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - dot_print_specialization(); \ - int N = numElems; \ - int one = 1; \ - R() = HostBlas::dot(N, X.data(), one, Y.data(), one); \ - } else { \ - Dot::dot(space, R, \ - X, Y); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSBLAS1_SDOT_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ - struct Dot< \ - ExecSpace, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + 1, 1, true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::View > \ RV; \ - typedef Kokkos::View, \ Kokkos::MemoryTraits > \ XV; \ typedef typename XV::size_type size_type; \ \ static void dot(const ExecSpace& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_BLAS,float]"); \ + Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_BLAS," + \ + Kokkos::ArithTraits::name() + \ + "]"); \ const size_type numElems = X.extent(0); \ if (numElems < static_cast(INT_MAX)) { \ dot_print_specialization(); \ int N = numElems; \ int one = 1; \ - R() = HostBlas::dot(N, X.data(), one, Y.data(), one); \ + R() = HostBlas::dot( \ + N, reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one); \ } else { \ Dot::dot(space, R, \ X, Y); \ @@ -112,105 +81,22 @@ namespace Impl { } \ }; -#define KOKKOSBLAS1_ZDOT_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ - struct Dot, LAYOUT, Kokkos::HostSpace, \ - Kokkos::MemoryTraits >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View, LAYOUT, Kokkos::HostSpace, \ - Kokkos::MemoryTraits > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void dot(const ExecSpace& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::dot[TPL_BLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - dot_print_specialization(); \ - int N = numElems; \ - int one = 1; \ - R() = HostBlas >::dot( \ - N, reinterpret_cast*>(X.data()), one, \ - reinterpret_cast*>(Y.data()), one); \ - } else { \ - Dot::dot(space, R, \ - X, Y); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ - struct Dot, LAYOUT, Kokkos::HostSpace, \ - Kokkos::MemoryTraits >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View, LAYOUT, Kokkos::HostSpace, \ - Kokkos::MemoryTraits > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void dot(const ExecSpace& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::dot[TPL_BLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - dot_print_specialization(); \ - int N = numElems; \ - int one = 1; \ - R() = HostBlas >::dot( \ - N, reinterpret_cast*>(X.data()), one, \ - reinterpret_cast*>(Y.data()), one); \ - } else { \ - Dot::dot(space, R, \ - X, Y); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -KOKKOSBLAS1_DDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSBLAS1_DDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) - -KOKKOSBLAS1_SDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSBLAS1_SDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) - -KOKKOSBLAS1_ZDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSBLAS1_ZDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) - -KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS_EXT(ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, float, float, \ + Kokkos::HostSpace, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, double, double, \ + Kokkos::HostSpace, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS( \ + Kokkos::LayoutLeft, Kokkos::complex, std::complex, \ + Kokkos::HostSpace, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS( \ + Kokkos::LayoutLeft, Kokkos::complex, std::complex, \ + Kokkos::HostSpace, ETI_SPEC_AVAIL) +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS_EXT(true) +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_BLAS_EXT(false) } // namespace Impl } // namespace KokkosBlas - #endif // cuBLAS @@ -219,38 +105,48 @@ KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, namespace KokkosBlas { namespace Impl { - -#define KOKKOSBLAS1_DDOT_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS(LAYOUT, KOKKOS_TYPE, TPL_TYPE, \ + EXECSPACE, MEMSPACE, TPL_DOT, \ + ETI_SPEC_AVAIL) \ template <> \ - struct Dot< \ - EXECSPACE, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + 1, 1, true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::View > \ RV; \ - typedef Kokkos::View, \ Kokkos::MemoryTraits > \ XV; \ typedef typename XV::size_type size_type; \ \ static void dot(const EXECSPACE& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_CUBLAS,double]"); \ + Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_CUBLAS," + \ + Kokkos::ArithTraits::name() + \ + "]"); \ const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ + /* TODO: CUDA-12's 64-bit indices allow larger numElems */ \ + if (numElems <= \ + static_cast(std::numeric_limits::max())) { \ dot_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ + const int N = static_cast(numElems); \ KokkosBlas::Impl::CudaBlasSingleton& s = \ KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - cublasDdot(s.handle, N, X.data(), one, Y.data(), one, &R()); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + TPL_DOT(s.handle, N, reinterpret_cast(X.data()), \ + 1, reinterpret_cast(Y.data()), 1, \ + reinterpret_cast(&R()))); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ } else { \ Dot::dot(space, R, \ X, Y); \ @@ -259,81 +155,73 @@ namespace Impl { } \ }; -#define KOKKOSBLAS1_SDOT_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template <> \ - struct Dot< \ - EXECSPACE, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void dot(const EXECSPACE& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_CUBLAS,float]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - dot_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - cublasSdot(s.handle, N, X.data(), one, Y.data(), one, &R()); \ - } else { \ - Dot::dot(space, R, \ - X, Y); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS_EXT(ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, float, float, \ + Kokkos::Cuda, Kokkos::CudaSpace, \ + cublasSdot, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, double, double, \ + Kokkos::Cuda, Kokkos::CudaSpace, \ + cublasDdot, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS( \ + Kokkos::LayoutLeft, Kokkos::complex, cuComplex, Kokkos::Cuda, \ + Kokkos::CudaSpace, cublasCdotc, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS( \ + Kokkos::LayoutLeft, Kokkos::complex, cuDoubleComplex, \ + Kokkos::Cuda, Kokkos::CudaSpace, cublasZdotc, ETI_SPEC_AVAIL) + +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS_EXT(true) +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_CUBLAS_EXT(false) +} // namespace Impl +} // namespace KokkosBlas +#endif + +// rocBLAS +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCBLAS +#include -#define KOKKOSBLAS1_ZDOT_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ +namespace KokkosBlas { +namespace Impl { +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS(LAYOUT, KOKKOS_TYPE, TPL_TYPE, \ + EXECSPACE, MEMSPACE, TPL_DOT, \ ETI_SPEC_AVAIL) \ template <> \ struct Dot, LAYOUT, Kokkos::HostSpace, \ + Kokkos::View >, \ - Kokkos::View*, LAYOUT, \ + Kokkos::View, \ Kokkos::MemoryTraits >, \ - Kokkos::View*, LAYOUT, \ + Kokkos::View, \ Kokkos::MemoryTraits >, \ 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View, LAYOUT, Kokkos::HostSpace, \ + typedef Kokkos::View > \ RV; \ - typedef Kokkos::View*, LAYOUT, \ + typedef Kokkos::View, \ Kokkos::MemoryTraits > \ XV; \ typedef typename XV::size_type size_type; \ \ static void dot(const EXECSPACE& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::dot[TPL_CUBLAS,complex]"); \ + Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_ROCBLAS," + \ + Kokkos::ArithTraits::name() + \ + "]"); \ const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ + if (numElems <= \ + static_cast(std::numeric_limits::max())) { \ dot_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - cublasZdotc(s.handle, N, \ - reinterpret_cast(X.data()), one, \ - reinterpret_cast(Y.data()), one, \ - reinterpret_cast(&R())); \ + const rocblas_int N = static_cast(numElems); \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + TPL_DOT(s.handle, N, reinterpret_cast(X.data()), \ + 1, reinterpret_cast(Y.data()), 1, \ + reinterpret_cast(&R()))); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ } else { \ Dot::dot(space, R, \ X, Y); \ @@ -342,72 +230,100 @@ namespace Impl { } \ }; -#define KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS_EXT(ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, float, float, \ + Kokkos::HIP, Kokkos::HIPSpace, \ + rocblas_sdot, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, double, double, \ + Kokkos::HIP, Kokkos::HIPSpace, \ + rocblas_ddot, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS( \ + Kokkos::LayoutLeft, Kokkos::complex, rocblas_float_complex, \ + Kokkos::HIP, Kokkos::HIPSpace, rocblas_cdotc, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS( \ + Kokkos::LayoutLeft, Kokkos::complex, rocblas_double_complex, \ + Kokkos::HIP, Kokkos::HIPSpace, rocblas_zdotc, ETI_SPEC_AVAIL) + +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS_EXT(true) +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ROCBLAS_EXT(false) +} // namespace Impl +} // namespace KokkosBlas +#endif + +// ONEMKL +#if defined(KOKKOSKERNELS_ENABLE_TPL_MKL) && defined(KOKKOS_ENABLE_SYCL) +#include +#include +#include + +namespace KokkosBlas { +namespace Impl { +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL(LAYOUT, KOKKOS_TYPE, TPL_TYPE, \ + EXECSPACE, MEMSPACE, TPL_DOT, \ + ETI_SPEC_AVAIL) \ template <> \ struct Dot, LAYOUT, Kokkos::HostSpace, \ + Kokkos::View >, \ - Kokkos::View*, LAYOUT, \ + Kokkos::View, \ Kokkos::MemoryTraits >, \ - Kokkos::View*, LAYOUT, \ + Kokkos::View, \ Kokkos::MemoryTraits >, \ 1, 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View, LAYOUT, Kokkos::HostSpace, \ + typedef Kokkos::View > \ RV; \ - typedef Kokkos::View*, LAYOUT, \ + typedef Kokkos::View, \ Kokkos::MemoryTraits > \ XV; \ typedef typename XV::size_type size_type; \ \ - static void dot(const EXECSPACE& space, RV& R, const XV& X, const XV& Y) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::dot[TPL_CUBLAS,complex]"); \ + static void dot(const EXECSPACE& exec, RV& R, const XV& X, const XV& Y) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::dot[TPL_ONEMKL," + \ + Kokkos::ArithTraits::name() + \ + "]"); \ const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ + if (numElems <= \ + static_cast(std::numeric_limits::max())) { \ dot_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - cublasCdotc(s.handle, N, reinterpret_cast(X.data()), \ - one, reinterpret_cast(Y.data()), one, \ - reinterpret_cast(&R())); \ + const std::int64_t N = static_cast(numElems); \ + TPL_DOT(exec.sycl_queue(), N, \ + reinterpret_cast(X.data()), 1, \ + reinterpret_cast(Y.data()), 1, \ + reinterpret_cast(&R())); \ } else { \ - Dot::dot(space, R, \ + Dot::dot(exec, R, \ X, Y); \ } \ Kokkos::Profiling::popRegion(); \ } \ }; -KOKKOSBLAS1_DDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_DDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) - -KOKKOSBLAS1_SDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_SDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) - -KOKKOSBLAS1_ZDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_ZDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) - -KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_CDOT_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) +#define KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL_EXT(ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL( \ + Kokkos::LayoutLeft, float, float, Kokkos::Experimental::SYCL, \ + Kokkos::Experimental::SYCLDeviceUSMSpace, \ + oneapi::mkl::blas::row_major::dot, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL( \ + Kokkos::LayoutLeft, double, double, Kokkos::Experimental::SYCL, \ + Kokkos::Experimental::SYCLDeviceUSMSpace, \ + oneapi::mkl::blas::row_major::dot, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL( \ + Kokkos::LayoutLeft, Kokkos::complex, std::complex, \ + Kokkos::Experimental::SYCL, Kokkos::Experimental::SYCLDeviceUSMSpace, \ + oneapi::mkl::blas::row_major::dotc, ETI_SPEC_AVAIL) \ + KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL( \ + Kokkos::LayoutLeft, Kokkos::complex, std::complex, \ + Kokkos::Experimental::SYCL, Kokkos::Experimental::SYCLDeviceUSMSpace, \ + oneapi::mkl::blas::row_major::dotc, ETI_SPEC_AVAIL) +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL_EXT(true) +KOKKOSBLAS1_DOT_TPL_SPEC_DECL_ONEMKL_EXT(false) } // namespace Impl } // namespace KokkosBlas - #endif #endif diff --git a/blas/tpls/KokkosBlas1_nrm1_tpl_spec_avail.hpp b/blas/tpls/KokkosBlas1_nrm1_tpl_spec_avail.hpp index 04ec811990..be0a45c7be 100644 --- a/blas/tpls/KokkosBlas1_nrm1_tpl_spec_avail.hpp +++ b/blas/tpls/KokkosBlas1_nrm1_tpl_spec_avail.hpp @@ -113,6 +113,40 @@ KOKKOSBLAS1_NRM1_TPL_SPEC_AVAIL_ROCBLAS(Kokkos::complex, #endif // KOKKOSKERNELS_ENABLE_TPL_ROCBLAS +// oneMKL +#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL + +#if defined(KOKKOS_ENABLE_SYCL) && \ + !defined(KOKKOSKERNELS_ENABLE_TPL_MKL_SYCL_OVERRIDE) + +#define KOKKOSBLAS1_NRM1_TPL_SPEC_AVAIL_MKL_SYCL(SCALAR, LAYOUT, MEMSPACE) \ + template \ + struct nrm1_tpl_spec_avail< \ + ExecSpace, \ + Kokkos::View< \ + typename Kokkos::Details::InnerProductSpaceTraits::mag_type, \ + LAYOUT, Kokkos::HostSpace, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + 1> { \ + enum : bool { value = true }; \ + }; + +KOKKOSBLAS1_NRM1_TPL_SPEC_AVAIL_MKL_SYCL( + double, Kokkos::LayoutLeft, Kokkos::Experimental::SYCLDeviceUSMSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_AVAIL_MKL_SYCL( + float, Kokkos::LayoutLeft, Kokkos::Experimental::SYCLDeviceUSMSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_AVAIL_MKL_SYCL( + Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLDeviceUSMSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_AVAIL_MKL_SYCL( + Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLDeviceUSMSpace) + +#endif // KOKKOS_ENABLE_SYCL +#endif // KOKKOSKERNELS_ENABLE_TPL_MKL + } // namespace Impl } // namespace KokkosBlas #endif diff --git a/blas/tpls/KokkosBlas1_nrm1_tpl_spec_decl.hpp b/blas/tpls/KokkosBlas1_nrm1_tpl_spec_decl.hpp index b5b6e061ec..c695eaee1e 100644 --- a/blas/tpls/KokkosBlas1_nrm1_tpl_spec_decl.hpp +++ b/blas/tpls/KokkosBlas1_nrm1_tpl_spec_decl.hpp @@ -39,161 +39,88 @@ inline void nrm1_print_specialization() { namespace KokkosBlas { namespace Impl { -#define KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ - struct Nrm1< \ - ExecSpace, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_BLAS,double]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - int N = numElems; \ - int one = 1; \ - R() = HostBlas::asum(N, X.data(), one); \ - } else { \ - Nrm1::nrm1(space, R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ +#define KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(SCALAR, LAYOUT, EXECSPACE, \ + MEMSPACE) \ + template <> \ struct Nrm1< \ - ExecSpace, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ + EXECSPACE, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + 1, true, \ + nrm1_eti_spec_avail< \ + EXECSPACE, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using mag_type = typename Kokkos::ArithTraits::mag_type; \ + using RV = Kokkos::View>; \ + using XV = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using size_type = typename XV::size_type; \ \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_BLAS,float]"); \ + static void nrm1(const EXECSPACE& space, RV& R, const XV& X) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_BLAS," #SCALAR "]"); \ const size_type numElems = X.extent(0); \ if (numElems < static_cast(INT_MAX)) { \ nrm1_print_specialization(); \ int N = numElems; \ int one = 1; \ - R() = HostBlas::asum(N, X.data(), one); \ + if constexpr (Kokkos::ArithTraits::is_complex) { \ + R() = HostBlas>::asum( \ + N, reinterpret_cast*>(X.data()), \ + one); \ + } else { \ + R() = HostBlas::asum(N, X.data(), one); \ + } \ } else { \ - Nrm1::nrm1(space, R, X); \ + Nrm1::value>::nrm1(space, R, \ + X); \ } \ Kokkos::Profiling::popRegion(); \ } \ }; -#define KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ - struct Nrm1 >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::nrm1[TPL_BLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - int N = numElems; \ - int one = 1; \ - R() = HostBlas >::asum( \ - N, reinterpret_cast*>(X.data()), one); \ - } else { \ - Nrm1::nrm1(space, R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_BLAS(LAYOUT, MEMSPACE, ETI_SPEC_AVAIL) \ - template \ - struct Nrm1 >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::nrm1[TPL_BLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - int N = numElems; \ - int one = 1; \ - R() = HostBlas >::asum( \ - N, reinterpret_cast*>(X.data()), one); \ - } else { \ - Nrm1::nrm1(space, R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - true) -KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) - -KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - true) -KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) +#if defined(KOKKOS_ENABLE_SERIAL) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(float, Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(double, Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial, Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial, Kokkos::HostSpace) +#endif -KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - true) -KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) +#if defined(KOKKOS_ENABLE_OPENMP) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(float, Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(double, Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP, Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP, Kokkos::HostSpace) +#endif -KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - true) -KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, - false) +#if defined(KOKKOS_ENABLE_THREADS) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(float, Kokkos::LayoutLeft, Kokkos::Threads, + Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(double, Kokkos::LayoutLeft, Kokkos::Threads, + Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads, Kokkos::HostSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads, Kokkos::HostSpace) +#endif } // namespace Impl } // namespace KokkosBlas @@ -207,202 +134,105 @@ KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_BLAS(Kokkos::LayoutLeft, Kokkos::HostSpace, namespace KokkosBlas { namespace Impl { -#define KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ +template +void cublasAsumWrapper(const ExecutionSpace& space, RViewType& R, + const XViewType& X) { + using XScalar = typename XViewType::non_const_value_type; + + nrm1_print_specialization(); + const int N = static_cast(X.extent(0)); + constexpr int one = 1; + KokkosBlas::Impl::CudaBlasSingleton& s = + KokkosBlas::Impl::CudaBlasSingleton::singleton(); + + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, space.cuda_stream())); + if constexpr (std::is_same_v) { + KOKKOS_CUBLAS_SAFE_CALL_IMPL( + cublasSasum(s.handle, N, X.data(), one, R.data())); + } + if constexpr (std::is_same_v) { + KOKKOS_CUBLAS_SAFE_CALL_IMPL( + cublasDasum(s.handle, N, X.data(), one, R.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_CUBLAS_SAFE_CALL_IMPL( + cublasScasum(s.handle, N, reinterpret_cast(X.data()), + one, R.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasDzasum( + s.handle, N, reinterpret_cast(X.data()), one, + R.data())); + } + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); +} + +#define KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(SCALAR, LAYOUT, MEMSPACE) \ template <> \ struct Nrm1< \ - EXECSPACE, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - using execution_space = EXECSPACE; \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ + Kokkos::Cuda, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + 1, true, \ + nrm1_eti_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using execution_space = Kokkos::Cuda; \ + using RV = Kokkos::View::mag_type, \ + LAYOUT, Kokkos::HostSpace, \ + Kokkos::MemoryTraits>; \ + using XV = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using size_type = typename XV::size_type; \ \ static void nrm1(const execution_space& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_CUBLAS,double]"); \ + Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_CUBLAS," #SCALAR \ + "]"); \ const size_type numElems = X.extent(0); \ if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ - cublasSetStream(s.handle, space.cuda_stream())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ - cublasDasum(s.handle, N, X.data(), one, R.data())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + cublasAsumWrapper(space, R, X); \ } else { \ - Nrm1::nrm1(space, \ - R, X); \ + Nrm1::value>::nrm1(space, R, \ + X); \ } \ Kokkos::Profiling::popRegion(); \ } \ }; -#define KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template <> \ - struct Nrm1< \ - EXECSPACE, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - using execution_space = EXECSPACE; \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const execution_space& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_CUBLAS,float]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ - cublasSetStream(s.handle, space.cuda_stream())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ - cublasSasum(s.handle, N, X.data(), one, R.data())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ - } else { \ - Nrm1::nrm1(space, \ - R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template <> \ - struct Nrm1 >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - using execution_space = EXECSPACE; \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const execution_space& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::nrm1[TPL_CUBLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ - cublasSetStream(s.handle, space.cuda_stream())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasDzasum( \ - s.handle, N, reinterpret_cast(X.data()), \ - one, R.data())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ - } else { \ - Nrm1::nrm1(space, \ - R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_CUBLAS(LAYOUT, EXECSPACE, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template <> \ - struct Nrm1 >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - using execution_space = EXECSPACE; \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const execution_space& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::nrm1[TPL_CUBLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::CudaBlasSingleton& s = \ - KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ - cublasSetStream(s.handle, space.cuda_stream())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasScasum( \ - s.handle, N, reinterpret_cast(X.data()), one, \ - R.data())); \ - KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ - } else { \ - Nrm1::nrm1(space, \ - R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) - -KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) - -KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) - -KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, true) -KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, - Kokkos::CudaSpace, false) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(float, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(double, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_CUDAUVMSPACE) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(float, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(double, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaUVMSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaUVMSpace) +#endif } // namespace Impl } // namespace KokkosBlas - -#endif +#endif // KOKKOSKERNELS_ENABLE_TPL_CUBLAS // rocBLAS #ifdef KOKKOSKERNELS_ENABLE_TPL_ROCBLAS @@ -411,195 +241,218 @@ KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, namespace KokkosBlas { namespace Impl { -#define KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_ROCBLAS(LAYOUT, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template \ +template +void rocblasAsumWrapper(const ExecutionSpace& space, RViewType& R, + const XViewType& X) { + using XScalar = typename XViewType::non_const_value_type; + + nrm1_print_specialization(); + const int N = static_cast(X.extent(0)); + constexpr int one = 1; + KokkosBlas::Impl::RocBlasSingleton& s = + KokkosBlas::Impl::RocBlasSingleton::singleton(); + + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( + rocblas_set_stream(s.handle, space.hip_stream())); + if constexpr (std::is_same_v) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( + rocblas_sasum(s.handle, N, X.data(), one, R.data())); + } + if constexpr (std::is_same_v) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( + rocblas_dasum(s.handle, N, X.data(), one, R.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_scasum( + s.handle, N, reinterpret_cast(X.data()), + one, R.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_dzasum( + s.handle, N, reinterpret_cast(X.data()), + one, R.data())); + } + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); +} + +#define KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_ROCBLAS(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ struct Nrm1< \ - ExecSpace, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ + Kokkos::HIP, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + 1, true, \ + nrm1_eti_spec_avail< \ + Kokkos::HIP, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using RV = Kokkos::View::mag_type, \ + LAYOUT, Kokkos::HostSpace, \ + Kokkos::MemoryTraits>; \ + using XV = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using size_type = typename XV::size_type; \ \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_ROCBLAS,double]"); \ + static void nrm1(const Kokkos::HIP& space, RV& R, const XV& X) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_ROCBLAS," #SCALAR \ + "]"); \ const size_type numElems = X.extent(0); \ if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::RocBlasSingleton& s = \ - KokkosBlas::Impl::RocBlasSingleton::singleton(); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ - rocblas_set_stream(s.handle, space.hip_stream())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ - rocblas_dasum(s.handle, N, X.data(), one, R.data())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + rocblasAsumWrapper(space, R, X); \ } else { \ - Nrm1::nrm1(space, R, X); \ + Nrm1::value>::nrm1(space, R, \ + X); \ } \ Kokkos::Profiling::popRegion(); \ } \ }; -#define KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_ROCBLAS(LAYOUT, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template \ - struct Nrm1< \ - ExecSpace, \ - Kokkos::View >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_ROCBLAS,float]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::RocBlasSingleton& s = \ - KokkosBlas::Impl::RocBlasSingleton::singleton(); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ - rocblas_set_stream(s.handle, space.hip_stream())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ - rocblas_sasum(s.handle, N, X.data(), one, R.data())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ - } else { \ - Nrm1::nrm1(space, R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_ROCBLAS(float, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_ROCBLAS(double, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSBLAS1_NRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIPSpace) -#define KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_ROCBLAS(LAYOUT, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template \ - struct Nrm1 >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::nrm1[TPL_ROCBLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::RocBlasSingleton& s = \ - KokkosBlas::Impl::RocBlasSingleton::singleton(); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ - rocblas_set_stream(s.handle, space.hip_stream())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_dzasum( \ - s.handle, N, \ - reinterpret_cast(X.data()), one, \ - R.data())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ - } else { \ - Nrm1::nrm1(space, R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +} // namespace Impl +} // namespace KokkosBlas -#define KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_ROCBLAS(LAYOUT, MEMSPACE, \ - ETI_SPEC_AVAIL) \ - template \ - struct Nrm1 >, \ - Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - 1, true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::View > \ - RV; \ - typedef Kokkos::View*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - XV; \ - typedef typename XV::size_type size_type; \ - \ - static void nrm1(const ExecSpace& space, RV& R, const XV& X) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosBlas::nrm1[TPL_ROCBLAS,complex]"); \ - const size_type numElems = X.extent(0); \ - if (numElems < static_cast(INT_MAX)) { \ - nrm1_print_specialization(); \ - const int N = static_cast(numElems); \ - constexpr int one = 1; \ - KokkosBlas::Impl::RocBlasSingleton& s = \ - KokkosBlas::Impl::RocBlasSingleton::singleton(); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ - rocblas_set_stream(s.handle, space.hip_stream())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_scasum( \ - s.handle, N, \ - reinterpret_cast(X.data()), one, \ - R.data())); \ - KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ - } else { \ - Nrm1::nrm1(space, R, X); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +#endif // KOKKOSKERNELS_ENABLE_TPL_ROCBLAS + +// oneMKL +#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL + +#if defined(KOKKOS_ENABLE_SYCL) && \ + !defined(KOKKOSKERNELS_ENABLE_TPL_MKL_SYCL_OVERRIDE) + +#include +#include -KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - true) -KOKKOSBLAS1_DNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - false) +namespace KokkosBlas { +namespace Impl { -KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - true) -KOKKOSBLAS1_SNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - false) +template +void onemklAsumWrapper(const ExecutionSpace& space, RViewType& R, + const XViewType& X) { + using XScalar = typename XViewType::non_const_value_type; + using KAT_X = Kokkos::ArithTraits; + using layout_t = typename XViewType::array_layout; + + const std::int64_t N = static_cast(X.extent(0)); + + // Create temp view on device to store the result + Kokkos::View::mag_type, + typename XViewType::memory_space> + res("sycl asum result"); + + // Decide to call row_major or column_major function + if constexpr (std::is_same_v) { + if constexpr (KAT_X::is_complex) { + oneapi::mkl::blas::row_major::asum( + space.sycl_queue(), N, + reinterpret_cast*>( + X.data()), + 1, res.data()); + } else { + oneapi::mkl::blas::row_major::asum(space.sycl_queue(), N, X.data(), 1, + res.data()); + } + } else { + if constexpr (KAT_X::is_complex) { + oneapi::mkl::blas::column_major::asum( + space.sycl_queue(), N, + reinterpret_cast*>( + X.data()), + 1, res.data()); + } else { + oneapi::mkl::blas::column_major::asum(space.sycl_queue(), X.extent_int(0), + X.data(), 1, res.data()); + } + } + // Bring result back to host + Kokkos::deep_copy(space, R, res); +} -KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - true) -KOKKOSBLAS1_ZNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - false) +#define KOKKOSBLAS1_NRM1_ONEMKL(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ + struct Nrm1< \ + Kokkos::Experimental::SYCL, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + 1, true, \ + nrm1_eti_spec_avail< \ + Kokkos::Experimental::SYCL, \ + Kokkos::View::mag_type, LAYOUT, \ + Kokkos::HostSpace, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using execution_space = Kokkos::Experimental::SYCL; \ + using RV = Kokkos::View::mag_type, \ + LAYOUT, Kokkos::HostSpace, \ + Kokkos::MemoryTraits>; \ + using XV = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using size_type = typename XV::size_type; \ + \ + static void nrm1(const execution_space& space, RV& R, const XV& X) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::nrm1[TPL_ONEMKL," #SCALAR \ + "]"); \ + const size_type numElems = X.extent(0); \ + if (numElems < static_cast(INT_MAX)) { \ + onemklAsumWrapper(space, R, X); \ + } else { \ + Nrm1::value>::nrm1(space, R, X); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; -KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - true) -KOKKOSBLAS1_CNRM1_TPL_SPEC_DECL_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIPSpace, - false) +KOKKOSBLAS1_NRM1_ONEMKL(float, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLDeviceUSMSpace) +KOKKOSBLAS1_NRM1_ONEMKL(double, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLDeviceUSMSpace) +KOKKOSBLAS1_NRM1_ONEMKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLDeviceUSMSpace) +KOKKOSBLAS1_NRM1_ONEMKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLDeviceUSMSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_SYCLSHAREDSPACE) +KOKKOSBLAS1_NRM1_ONEMKL(float, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLSharedUSMSpace) +KOKKOSBLAS1_NRM1_ONEMKL(double, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLSharedUSMSpace) +KOKKOSBLAS1_NRM1_ONEMKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLSharedUSMSpace) +KOKKOSBLAS1_NRM1_ONEMKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Experimental::SYCLSharedUSMSpace) +#endif } // namespace Impl } // namespace KokkosBlas -#endif +#endif // KOKKOS_ENABLE_SYCL +#endif // KOKKOSKERNELS_ENABLE_TPL_MKL #endif diff --git a/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_blas.hpp b/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_blas.hpp index 3ba437a5a7..bc1a10f61e 100644 --- a/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_blas.hpp +++ b/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_blas.hpp @@ -149,9 +149,7 @@ namespace Impl { Kokkos::MemoryTraits> \ AViewType; \ \ - static void ger(const EXEC_SPACE& /* space */ \ - , \ - const char trans[], \ + static void ger(const EXEC_SPACE& space, const char trans[], \ typename AViewType::const_value_type& alpha, \ const XViewType& X, const YViewType& Y, \ const AViewType& A) { \ @@ -183,8 +181,9 @@ namespace Impl { reinterpret_cast*>(X.data()), one, \ reinterpret_cast*>(A.data()), LDA); \ } else { \ - throw std::runtime_error( \ - "Error: blasZgerc() requires LayoutLeft views."); \ + /* blasgerc() + ~A_ll => call kokkos-kernels' implementation */ \ + GER::ger(space, trans, alpha, X, Y, A); \ } \ } \ Kokkos::Profiling::popRegion(); \ @@ -218,9 +217,7 @@ namespace Impl { Kokkos::MemoryTraits> \ AViewType; \ \ - static void ger(const EXEC_SPACE& /* space */ \ - , \ - const char trans[], \ + static void ger(const EXEC_SPACE& space, const char trans[], \ typename AViewType::const_value_type& alpha, \ const XViewType& X, const YViewType& Y, \ const AViewType& A) { \ @@ -252,8 +249,9 @@ namespace Impl { reinterpret_cast*>(X.data()), one, \ reinterpret_cast*>(A.data()), LDA); \ } else { \ - throw std::runtime_error( \ - "Error: blasCgerc() requires LayoutLeft views."); \ + /* blasgerc() + ~A_ll => call kokkos-kernels' implementation */ \ + GER::ger(space, trans, alpha, X, Y, A); \ } \ } \ Kokkos::Profiling::popRegion(); \ diff --git a/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_cublas.hpp b/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_cublas.hpp index d05b09784e..3f80144f62 100644 --- a/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_cublas.hpp +++ b/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_cublas.hpp @@ -196,8 +196,9 @@ namespace Impl { reinterpret_cast(X.data()), one, \ reinterpret_cast(A.data()), LDA)); \ } else { \ - throw std::runtime_error( \ - "Error: cublasZgerc() requires LayoutLeft views."); \ + /* cublasZgerc() + ~A_ll => call kokkos-kernels' implementation */ \ + GER::ger(space, trans, alpha, X, Y, A); \ } \ } \ KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ @@ -266,8 +267,9 @@ namespace Impl { reinterpret_cast(X.data()), one, \ reinterpret_cast(A.data()), LDA)); \ } else { \ - throw std::runtime_error( \ - "Error: cublasCgerc() requires LayoutLeft views."); \ + /* cublasCgerc() + ~A_ll => call kokkos-kernels' implementation */ \ + GER::ger(space, trans, alpha, X, Y, A); \ } \ } \ KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ diff --git a/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_rocblas.hpp b/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_rocblas.hpp index c55d091516..c21b61befa 100644 --- a/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_rocblas.hpp +++ b/blas/tpls/KokkosBlas2_ger_tpl_spec_decl_rocblas.hpp @@ -199,8 +199,9 @@ namespace Impl { reinterpret_cast(X.data()), one, \ reinterpret_cast(A.data()), LDA)); \ } else { \ - throw std::runtime_error( \ - "Error: rocblasZgerc() requires LayoutLeft views."); \ + /* rocblas_zgerc() + ~A_ll => call k-kernels' implementation */ \ + GER::ger(space, trans, alpha, X, Y, A); \ } \ } \ KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ @@ -273,8 +274,9 @@ namespace Impl { reinterpret_cast(X.data()), one, \ reinterpret_cast(A.data()), LDA)); \ } else { \ - throw std::runtime_error( \ - "Error: rocblasCgec() requires LayoutLeft views."); \ + /* rocblas_cgerc() + ~A_ll => call k-kernels' implementation */ \ + GER::ger(space, trans, alpha, X, Y, A); \ } \ } \ KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ diff --git a/blas/tpls/KokkosBlas2_syr2_tpl_spec_avail.hpp b/blas/tpls/KokkosBlas2_syr2_tpl_spec_avail.hpp new file mode 100644 index 0000000000..59fb154d35 --- /dev/null +++ b/blas/tpls/KokkosBlas2_syr2_tpl_spec_avail.hpp @@ -0,0 +1,205 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_HPP_ +#define KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_HPP_ + +namespace KokkosBlas { +namespace Impl { +// Specialization struct which defines whether a specialization exists +template +struct syr2_tpl_spec_avail { + enum : bool { value = false }; +}; + +// Generic Host side BLAS (could be MKL or whatever) +#ifdef KOKKOSKERNELS_ENABLE_TPL_BLAS + +#define KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(SCALAR, LAYOUT, EXEC_SPACE, \ + MEM_SPACE) \ + template <> \ + struct syr2_tpl_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +#ifdef KOKKOS_ENABLE_SERIAL +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(double, Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(float, Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial, Kokkos::HostSpace) + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(double, Kokkos::LayoutRight, + Kokkos::Serial, Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(float, Kokkos::LayoutRight, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::Serial, + Kokkos::HostSpace) +#endif + +#ifdef KOKKOS_ENABLE_OPENMP +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(double, Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(float, Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP, Kokkos::HostSpace) + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(double, Kokkos::LayoutRight, + Kokkos::OpenMP, Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(float, Kokkos::LayoutRight, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_BLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::OpenMP, + Kokkos::HostSpace) +#endif + +#endif + +// cuBLAS +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS + +#define KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(SCALAR, LAYOUT, EXEC_SPACE, \ + MEM_SPACE) \ + template <> \ + struct syr2_tpl_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(double, Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(float, Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaSpace) + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(double, Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaUVMSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(float, Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaUVMSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaUVMSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaUVMSpace) + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(double, Kokkos::LayoutRight, + Kokkos::Cuda, Kokkos::CudaSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(float, Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaSpace) + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(double, Kokkos::LayoutRight, + Kokkos::Cuda, Kokkos::CudaUVMSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(float, Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_CUBLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace) + +#endif + +// rocBLAS +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCBLAS + +#define KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(SCALAR, LAYOUT, EXEC_SPACE, \ + MEM_SPACE) \ + template <> \ + struct syr2_tpl_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(double, Kokkos::LayoutLeft, Kokkos::HIP, + Kokkos::HIPSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(float, Kokkos::LayoutLeft, Kokkos::HIP, + Kokkos::HIPSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIP, + Kokkos::HIPSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIP, + Kokkos::HIPSpace) + +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(double, Kokkos::LayoutRight, + Kokkos::HIP, Kokkos::HIPSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(float, Kokkos::LayoutRight, Kokkos::HIP, + Kokkos::HIPSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::HIP, + Kokkos::HIPSpace) +KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_ROCBLAS(Kokkos::complex, + Kokkos::LayoutRight, Kokkos::HIP, + Kokkos::HIPSpace) + +#endif +} // namespace Impl +} // namespace KokkosBlas + +#endif // KOKKOSBLAS2_SYR2_TPL_SPEC_AVAIL_HPP_ diff --git a/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl.hpp b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl.hpp new file mode 100644 index 0000000000..66ba81b685 --- /dev/null +++ b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl.hpp @@ -0,0 +1,35 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_HPP_ +#define KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_HPP_ + +// BLAS +#ifdef KOKKOSKERNELS_ENABLE_TPL_BLAS +#include +#endif + +// cuBLAS +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUBLAS +#include +#endif + +// rocBLAS +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCBLAS +#include +#endif + +#endif diff --git a/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_blas.hpp b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_blas.hpp new file mode 100644 index 0000000000..8561675c72 --- /dev/null +++ b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_blas.hpp @@ -0,0 +1,317 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_BLAS_HPP_ +#define KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_BLAS_HPP_ + +#include "KokkosBlas_Host_tpl.hpp" + +namespace KokkosBlas { +namespace Impl { + +#define KOKKOSBLAS2_SYR2_DETERMINE_ARGS(LAYOUT) \ + bool A_is_ll = std::is_same::value; \ + bool A_is_lr = std::is_same::value; \ + const int N = static_cast(A_is_lr ? A.extent(0) : A.extent(1)); \ + constexpr int one = 1; \ + const int LDA = A_is_lr ? A.stride(0) : A.stride(1); + +#define KOKKOSBLAS2_DSYR2_BLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, ETI_SPEC_AVAIL> { \ + typedef double SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::syr2[TPL_BLAS,double]"); \ + KOKKOSBLAS2_SYR2_DETERMINE_ARGS(LAYOUT); \ + if (A_is_ll) { \ + HostBlas::syr2(uplo[0], N, alpha, X.data(), one, Y.data(), \ + one, A.data(), LDA); \ + } else { \ + /* blasDsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_SSYR2_BLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, ETI_SPEC_AVAIL> { \ + typedef float SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::syr2[TPL_BLAS,float]"); \ + KOKKOSBLAS2_SYR2_DETERMINE_ARGS(LAYOUT); \ + if (A_is_ll) { \ + HostBlas::syr2(uplo[0], N, alpha, X.data(), one, Y.data(), \ + one, A.data(), LDA); \ + } else { \ + /* blasSsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_ZSYR2_BLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View**, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::complex SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosBlas::syr2[TPL_BLAS,complex"); \ + KOKKOSBLAS2_SYR2_DETERMINE_ARGS(LAYOUT); \ + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); \ + if (justTranspose) { \ + /* No blasZsyr2() => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } else { \ + if (A_is_ll) { \ + HostBlas>::zher2( \ + uplo[0], N, alpha, \ + reinterpret_cast*>(X.data()), one, \ + reinterpret_cast*>(Y.data()), one, \ + reinterpret_cast*>(A.data()), LDA); \ + } else { \ + /* blasZher2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_CSYR2_BLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View**, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::complex SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits> \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosBlas::syr2[TPL_BLAS,complex"); \ + KOKKOSBLAS2_SYR2_DETERMINE_ARGS(LAYOUT); \ + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); \ + if (justTranspose) { \ + /* No blasCsyr2() => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } else { \ + if (A_is_ll) { \ + HostBlas>::cher2( \ + uplo[0], N, alpha, \ + reinterpret_cast*>(X.data()), one, \ + reinterpret_cast*>(Y.data()), one, \ + reinterpret_cast*>(A.data()), LDA); \ + } else { \ + /* blasCher2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#ifdef KOKKOS_ENABLE_SERIAL +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + false) +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + false) + +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + false) +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + false) + +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + false) +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + false) + +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::HostSpace, + false) +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + true) +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutRight, Kokkos::Serial, Kokkos::HostSpace, + false) +#endif + +#ifdef KOKKOS_ENABLE_OPENMP +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + false) +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_DSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + false) + +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + false) +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_SSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + false) + +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + false) +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_ZSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + false) + +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutLeft, Kokkos::OpenMP, Kokkos::HostSpace, + false) +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + true) +KOKKOSBLAS2_CSYR2_BLAS(Kokkos::LayoutRight, Kokkos::OpenMP, Kokkos::HostSpace, + false) +#endif + +} // namespace Impl +} // namespace KokkosBlas + +#endif diff --git a/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_cublas.hpp b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_cublas.hpp new file mode 100644 index 0000000000..ca98fedf0d --- /dev/null +++ b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_cublas.hpp @@ -0,0 +1,372 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_CUBLAS_HPP_ +#define KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_CUBLAS_HPP_ + +#include + +namespace KokkosBlas { +namespace Impl { + +#define KOKKOSBLAS2_SYR2_CUBLAS_DETERMINE_ARGS(LAYOUT, uploChar) \ + bool A_is_ll = std::is_same::value; \ + bool A_is_lr = std::is_same::value; \ + const int N = static_cast(A_is_lr ? A.extent(0) : A.extent(1)); \ + constexpr int one = 1; \ + const int LDA = A_is_lr ? A.stride(0) : A.stride(1); \ + cublasFillMode_t fillMode = (uploChar == 'L' || uploChar == 'l') \ + ? CUBLAS_FILL_MODE_LOWER \ + : CUBLAS_FILL_MODE_UPPER; + +#define KOKKOSBLAS2_DSYR2_CUBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef double SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::syr2[TPL_CUBLAS,double]"); \ + KOKKOSBLAS2_SYR2_CUBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + if (A_is_ll) { \ + KokkosBlas::Impl::CudaBlasSingleton& s = \ + KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasDsyr2(s.handle, fillMode, N, &alpha, X.data(), one, \ + Y.data(), one, A.data(), LDA)); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + } else { \ + /* cublasDsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_SSYR2_CUBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef float SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::syr2[TPL_CUBLAS,float]"); \ + KOKKOSBLAS2_SYR2_CUBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + if (A_is_ll) { \ + KokkosBlas::Impl::CudaBlasSingleton& s = \ + KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSsyr2(s.handle, fillMode, N, &alpha, X.data(), one, \ + Y.data(), one, A.data(), LDA)); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + } else { \ + /* cublasSsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_ZSYR2_CUBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View**, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::complex SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosBlas::syr2[TPL_CUBLAS,complex]"); \ + KOKKOSBLAS2_SYR2_CUBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); \ + if (justTranspose) { \ + if (A_is_ll) { \ + KokkosBlas::Impl::CudaBlasSingleton& s = \ + KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasZsyr2( \ + s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + } else { \ + /* cublasZsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } else { \ + if (A_is_ll) { \ + KokkosBlas::Impl::CudaBlasSingleton& s = \ + KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasZher2( \ + s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + } else { \ + /* cublasZher2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_CSYR2_CUBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View**, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::complex SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosBlas::syr2[TPL_CUBLAS,complex]"); \ + KOKKOSBLAS2_SYR2_CUBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); \ + if (justTranspose) { \ + if (A_is_ll) { \ + KokkosBlas::Impl::CudaBlasSingleton& s = \ + KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasCsyr2(s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + } else { \ + /* cublasCsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } else { \ + if (A_is_ll) { \ + KokkosBlas::Impl::CudaBlasSingleton& s = \ + KokkosBlas::Impl::CudaBlasSingleton::singleton(); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasSetStream(s.handle, space.cuda_stream())); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL( \ + cublasCher2(s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_CUBLAS_SAFE_CALL_IMPL(cublasSetStream(s.handle, NULL)); \ + } else { \ + /* cublasCher2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + false) +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + false) + +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + true) +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + false) +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, true) +KOKKOSBLAS2_DSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, false) + +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + false) +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + false) + +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + true) +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + false) +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, true) +KOKKOSBLAS2_SSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, false) + +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + false) +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + false) + +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + true) +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + false) +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, true) +KOKKOSBLAS2_ZSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, false) + +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, + false) +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + true) +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, Kokkos::CudaSpace, + false) + +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + true) +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaUVMSpace, + false) +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, true) +KOKKOSBLAS2_CSYR2_CUBLAS(Kokkos::LayoutRight, Kokkos::Cuda, + Kokkos::CudaUVMSpace, false) + +} // namespace Impl +} // namespace KokkosBlas + +#endif diff --git a/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_rocblas.hpp b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_rocblas.hpp new file mode 100644 index 0000000000..869c065af2 --- /dev/null +++ b/blas/tpls/KokkosBlas2_syr2_tpl_spec_decl_rocblas.hpp @@ -0,0 +1,336 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_ROCBLAS_HPP_ +#define KOKKOSBLAS2_SYR2_TPL_SPEC_DECL_ROCBLAS_HPP_ + +#include + +namespace KokkosBlas { +namespace Impl { + +#define KOKKOSBLAS2_SYR2_ROCBLAS_DETERMINE_ARGS(LAYOUT, uploChar) \ + bool A_is_ll = std::is_same::value; \ + bool A_is_lr = std::is_same::value; \ + const int N = static_cast(A_is_lr ? A.extent(0) : A.extent(1)); \ + constexpr int one = 1; \ + const int LDA = A_is_lr ? A.stride(0) : A.stride(1); \ + rocblas_fill fillMode = (uploChar == 'L' || uploChar == 'l') \ + ? rocblas_fill_lower \ + : rocblas_fill_upper; + +#define KOKKOSBLAS2_DSYR2_ROCBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef double SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::syr2[TPL_ROCBLAS,double]"); \ + KOKKOSBLAS2_SYR2_ROCBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + if (A_is_ll) { \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_dsyr2(s.handle, fillMode, N, &alpha, X.data(), one, \ + Y.data(), one, A.data(), LDA)); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + } else { \ + /* rocblas_dsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_SSYR2_ROCBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef float SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion("KokkosBlas::syr2[TPL_ROCBLAS,float]"); \ + KOKKOSBLAS2_SYR2_ROCBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + if (A_is_ll) { \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_ssyr2(s.handle, fillMode, N, &alpha, X.data(), one, \ + Y.data(), one, A.data(), LDA)); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + } else { \ + /* rocblas_ssyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_ZSYR2_ROCBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View**, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::complex SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosBlas::syr2[TPL_ROCBLAS,complex]"); \ + KOKKOSBLAS2_SYR2_ROCBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); \ + if (justTranspose) { \ + if (A_is_ll) { \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_zsyr2( \ + s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + } else { \ + /* rocblas_zsyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } else { \ + if (A_is_ll && (alpha.imag() == 0.)) { \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_zher2( \ + s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + } else { \ + /* rocblas_zher2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSBLAS2_CSYR2_ROCBLAS(LAYOUT, EXEC_SPACE, MEM_SPACE, \ + ETI_SPEC_AVAIL) \ + template <> \ + struct SYR2*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View**, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + typedef Kokkos::complex SCALAR; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + XViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + YViewType; \ + typedef Kokkos::View, \ + Kokkos::MemoryTraits > \ + AViewType; \ + \ + static void syr2(const typename AViewType::execution_space& space, \ + const char trans[], const char uplo[], \ + typename AViewType::const_value_type& alpha, \ + const XViewType& X, const YViewType& Y, \ + const AViewType& A) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosBlas::syr2[TPL_ROCBLAS,complex]"); \ + KOKKOSBLAS2_SYR2_ROCBLAS_DETERMINE_ARGS(LAYOUT, uplo[0]); \ + bool justTranspose = (trans[0] == 'T') || (trans[0] == 't'); \ + if (justTranspose) { \ + if (A_is_ll) { \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_csyr2( \ + s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + } else { \ + /* rocblas_csyr2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } else { \ + if (A_is_ll && (alpha.imag() == 0.)) { \ + KokkosBlas::Impl::RocBlasSingleton& s = \ + KokkosBlas::Impl::RocBlasSingleton::singleton(); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( \ + rocblas_set_stream(s.handle, space.hip_stream())); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_cher2( \ + s.handle, fillMode, N, \ + reinterpret_cast(&alpha), \ + reinterpret_cast(X.data()), one, \ + reinterpret_cast(Y.data()), one, \ + reinterpret_cast(A.data()), LDA)); \ + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); \ + } else { \ + /* rocblas_cher2() + ~A_ll => call kokkos-kernels' implementation */ \ + SYR2::syr2(space, trans, uplo, alpha, X, Y, A); \ + } \ + } \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +KOKKOSBLAS2_DSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_DSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + false) +KOKKOSBLAS2_DSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_DSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + false) + +KOKKOSBLAS2_SSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_SSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + false) +KOKKOSBLAS2_SSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_SSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + false) + +KOKKOSBLAS2_ZSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_ZSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + false) +KOKKOSBLAS2_ZSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_ZSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + false) + +KOKKOSBLAS2_CSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_CSYR2_ROCBLAS(Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, + false) +KOKKOSBLAS2_CSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + true) +KOKKOSBLAS2_CSYR2_ROCBLAS(Kokkos::LayoutRight, Kokkos::HIP, Kokkos::HIPSpace, + false) + +} // namespace Impl +} // namespace KokkosBlas + +#endif diff --git a/blas/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp b/blas/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp index 66177e28a6..68bf2708ec 100644 --- a/blas/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp +++ b/blas/tpls/KokkosBlas3_gemm_tpl_spec_decl.hpp @@ -60,20 +60,20 @@ namespace Impl { Kokkos::Profiling::pushRegion("KokkosBlas::gemm[TPL_BLAS," #SCALAR_TYPE \ "]"); \ const bool A_t = (transA[0] != 'N') && (transA[0] != 'n'); \ - const int M = C.extent(0); \ - const int N = C.extent(1); \ - const int K = A.extent(A_t ? 0 : 1); \ + const KK_INT M = C.extent(0); \ + const KK_INT N = C.extent(1); \ + const KK_INT K = A.extent(A_t ? 0 : 1); \ \ bool A_is_lr = std::is_same::value; \ bool B_is_lr = std::is_same::value; \ bool C_is_lr = std::is_same::value; \ \ - const int AST = A_is_lr ? A.stride(0) : A.stride(1), \ - LDA = AST == 0 ? 1 : AST; \ - const int BST = B_is_lr ? B.stride(0) : B.stride(1), \ - LDB = BST == 0 ? 1 : BST; \ - const int CST = C_is_lr ? C.stride(0) : C.stride(1), \ - LDC = CST == 0 ? 1 : CST; \ + const KK_INT AST = A_is_lr ? A.stride(0) : A.stride(1), \ + LDA = AST == 0 ? 1 : AST; \ + const KK_INT BST = B_is_lr ? B.stride(0) : B.stride(1), \ + LDB = BST == 0 ? 1 : BST; \ + const KK_INT CST = C_is_lr ? C.stride(0) : C.stride(1), \ + LDC = CST == 0 ? 1 : CST; \ \ const BASE_SCALAR_TYPE alpha_val = alpha, beta_val = beta; \ if (!A_is_lr && !B_is_lr && !C_is_lr) \ diff --git a/blas/tpls/KokkosBlas_Host_tpl.cpp b/blas/tpls/KokkosBlas_Host_tpl.cpp index 6b158f4d19..50aab57c73 100644 --- a/blas/tpls/KokkosBlas_Host_tpl.cpp +++ b/blas/tpls/KokkosBlas_Host_tpl.cpp @@ -22,140 +22,162 @@ #if defined(KOKKOSKERNELS_ENABLE_TPL_BLAS) +using KokkosBlas::Impl::KK_INT; + /// Fortran headers extern "C" { /// /// scal /// -void F77_BLAS_MANGLE(sscal, SSCAL)(const int* N, const float* alpha, - /* */ float* x, const int* x_inc); -void F77_BLAS_MANGLE(dscal, DSCAL)(const int* N, const double* alpha, - /* */ double* x, const int* x_inc); +void F77_BLAS_MANGLE(sscal, SSCAL)(const KK_INT* N, const float* alpha, + /* */ float* x, const KK_INT* x_inc); +void F77_BLAS_MANGLE(dscal, DSCAL)(const KK_INT* N, const double* alpha, + /* */ double* x, const KK_INT* x_inc); void F77_BLAS_MANGLE(cscal, - CSCAL)(const int* N, const std::complex* alpha, - /* */ std::complex* x, const int* x_inc); + CSCAL)(const KK_INT* N, const std::complex* alpha, + /* */ std::complex* x, const KK_INT* x_inc); void F77_BLAS_MANGLE(zscal, - ZSCAL)(const int* N, const std::complex* alpha, - /* */ std::complex* x, const int* x_inc); + ZSCAL)(const KK_INT* N, const std::complex* alpha, + /* */ std::complex* x, const KK_INT* x_inc); /// /// max /// -int F77_BLAS_MANGLE(isamax, ISAMAX)(const int* N, const float* x, - const int* x_inc); -int F77_BLAS_MANGLE(idamax, IDAMAX)(const int* N, const double* x, - const int* x_inc); -int F77_BLAS_MANGLE(icamax, ICAMAX)(const int* N, const std::complex* x, - const int* x_inc); -int F77_BLAS_MANGLE(izamax, IZAMAX)(const int* N, const std::complex* x, - const int* x_inc); +KK_INT F77_BLAS_MANGLE(isamax, ISAMAX)(const KK_INT* N, const float* x, + const KK_INT* x_inc); +KK_INT F77_BLAS_MANGLE(idamax, IDAMAX)(const KK_INT* N, const double* x, + const KK_INT* x_inc); +KK_INT F77_BLAS_MANGLE(icamax, ICAMAX)(const KK_INT* N, + const std::complex* x, + const KK_INT* x_inc); +KK_INT F77_BLAS_MANGLE(izamax, IZAMAX)(const KK_INT* N, + const std::complex* x, + const KK_INT* x_inc); /// /// nrm2 /// -float F77_BLAS_MANGLE(snrm2, SNRM2)(const int* N, const float* x, - const int* x_inc); -double F77_BLAS_MANGLE(dnrm2, DNRM2)(const int* N, const double* x, - const int* x_inc); -float F77_BLAS_MANGLE(scnrm2, SCNRM2)(const int* N, +float F77_BLAS_MANGLE(snrm2, SNRM2)(const KK_INT* N, const float* x, + const KK_INT* x_inc); +double F77_BLAS_MANGLE(dnrm2, DNRM2)(const KK_INT* N, const double* x, + const KK_INT* x_inc); +float F77_BLAS_MANGLE(scnrm2, SCNRM2)(const KK_INT* N, const std::complex* x, - const int* x_inc); -double F77_BLAS_MANGLE(dznrm2, DZNRM2)(const int* N, + const KK_INT* x_inc); +double F77_BLAS_MANGLE(dznrm2, DZNRM2)(const KK_INT* N, const std::complex* x, - const int* x_inc); + const KK_INT* x_inc); /// /// sum /// -float F77_BLAS_MANGLE(sasum, SASUM)(const int* N, const float* x, - const int* x_inc); -double F77_BLAS_MANGLE(dasum, DASUM)(const int* N, const double* x, - const int* x_inc); -float F77_BLAS_MANGLE(scasum, SCASUM)(const int* N, +float F77_BLAS_MANGLE(sasum, SASUM)(const KK_INT* N, const float* x, + const KK_INT* x_inc); +double F77_BLAS_MANGLE(dasum, DASUM)(const KK_INT* N, const double* x, + const KK_INT* x_inc); +float F77_BLAS_MANGLE(scasum, SCASUM)(const KK_INT* N, const std::complex* x, - const int* x_inc); -double F77_BLAS_MANGLE(dzasum, DZASUM)(const int* N, + const KK_INT* x_inc); +double F77_BLAS_MANGLE(dzasum, DZASUM)(const KK_INT* N, const std::complex* x, - const int* x_inc); + const KK_INT* x_inc); /// /// dot /// -float F77_BLAS_MANGLE(sdot, SDOT)(const int* N, const float* x, - const int* x_inc, const float* y, - const int* y_inc); -double F77_BLAS_MANGLE(ddot, DDOT)(const int* N, const double* x, - const int* x_inc, const double* y, - const int* y_inc); +float F77_BLAS_MANGLE(sdot, SDOT)(const KK_INT* N, const float* x, + const KK_INT* x_inc, const float* y, + const KK_INT* y_inc); +double F77_BLAS_MANGLE(ddot, DDOT)(const KK_INT* N, const double* x, + const KK_INT* x_inc, const double* y, + const KK_INT* y_inc); #if defined(KOKKOSKERNELS_TPL_BLAS_RETURN_COMPLEX) -std::complex F77_BLAS_MANGLE(cdotu, CDOTU)(const int* N, - const std::complex* x, - const int* x_inc, - const std::complex* y, - const int* y_inc); -std::complex F77_BLAS_MANGLE(zdotu, ZDOTU)( - const int* N, const std::complex* x, const int* x_inc, - const std::complex* y, const int* y_inc); -std::complex F77_BLAS_MANGLE(cdotc, CDOTC)(const int* N, - const std::complex* x, - const int* x_inc, - const std::complex* y, - const int* y_inc); -std::complex F77_BLAS_MANGLE(zdotc, ZDOTC)( - const int* N, const std::complex* x, const int* x_inc, - const std::complex* y, const int* y_inc); +// clang-format off +// For the return type, don't use std::complex, otherwise compiler will complain +// error: 'cdotu_' has C-linkage specified, but returns user-defined type 'std::complex' which is incompatible with C [-Werror,-Wreturn-type-c-linkage]" +// But with float _Complex, I got error: '_Complex' is a C99 extension [-Werror,-Wc99-extensions]. +// So I just use a C struct. +// clang-format on +typedef struct { + float vals[2]; +} _kk_float2; +typedef struct { + double vals[2]; +} _kk_double2; + +_kk_float2 F77_BLAS_MANGLE(cdotu, CDOTU)(const KK_INT* N, + const std::complex* x, + const KK_INT* x_inc, + const std::complex* y, + const KK_INT* y_inc); +_kk_double2 F77_BLAS_MANGLE(zdotu, ZDOTU)(const KK_INT* N, + const std::complex* x, + const KK_INT* x_inc, + const std::complex* y, + const KK_INT* y_inc); +_kk_float2 F77_BLAS_MANGLE(cdotc, CDOTC)(const KK_INT* N, + const std::complex* x, + const KK_INT* x_inc, + const std::complex* y, + const KK_INT* y_inc); +_kk_double2 F77_BLAS_MANGLE(zdotc, ZDOTC)(const KK_INT* N, + const std::complex* x, + const KK_INT* x_inc, + const std::complex* y, + const KK_INT* y_inc); #else void F77_BLAS_MANGLE(cdotu, - CDOTU)(std::complex* res, const int* N, - const std::complex* x, const int* x_inc, - const std::complex* y, const int* y_inc); + CDOTU)(std::complex* res, const KK_INT* N, + const std::complex* x, const KK_INT* x_inc, + const std::complex* y, const KK_INT* y_inc); void F77_BLAS_MANGLE(zdotu, - ZDOTU)(std::complex* res, const int* N, - const std::complex* x, const int* x_inc, - const std::complex* y, const int* y_inc); + ZDOTU)(std::complex* res, const KK_INT* N, + const std::complex* x, const KK_INT* x_inc, + const std::complex* y, const KK_INT* y_inc); void F77_BLAS_MANGLE(cdotc, - CDOTC)(std::complex* res, const int* N, - const std::complex* x, const int* x_inc, - const std::complex* y, const int* y_inc); + CDOTC)(std::complex* res, const KK_INT* N, + const std::complex* x, const KK_INT* x_inc, + const std::complex* y, const KK_INT* y_inc); void F77_BLAS_MANGLE(zdotc, - ZDOTC)(std::complex* res, const int* N, - const std::complex* x, const int* x_inc, - const std::complex* y, const int* y_inc); + ZDOTC)(std::complex* res, const KK_INT* N, + const std::complex* x, const KK_INT* x_inc, + const std::complex* y, const KK_INT* y_inc); #endif /// /// axpy /// -void F77_BLAS_MANGLE(saxpy, SAXPY)(const int* N, const float* alpha, - const float* x, const int* x_inc, - /* */ float* y, const int* y_inc); -void F77_BLAS_MANGLE(daxpy, DAXPY)(const int* N, const double* alpha, - const double* x, const int* x_inc, - /* */ double* y, const int* y_inc); +void F77_BLAS_MANGLE(saxpy, SAXPY)(const KK_INT* N, const float* alpha, + const float* x, const KK_INT* x_inc, + /* */ float* y, const KK_INT* y_inc); +void F77_BLAS_MANGLE(daxpy, DAXPY)(const KK_INT* N, const double* alpha, + const double* x, const KK_INT* x_inc, + /* */ double* y, const KK_INT* y_inc); void F77_BLAS_MANGLE(caxpy, - CAXPY)(const int* N, const std::complex* alpha, - const std::complex* x, const int* x_inc, - /* */ std::complex* y, const int* y_inc); + CAXPY)(const KK_INT* N, const std::complex* alpha, + const std::complex* x, const KK_INT* x_inc, + /* */ std::complex* y, const KK_INT* y_inc); void F77_BLAS_MANGLE(zaxpy, - ZAXPY)(const int* N, const std::complex* alpha, - const std::complex* x, const int* x_inc, - /* */ std::complex* y, const int* y_inc); + ZAXPY)(const KK_INT* N, const std::complex* alpha, + const std::complex* x, const KK_INT* x_inc, + /* */ std::complex* y, const KK_INT* y_inc); /// /// rot /// -void F77_BLAS_MANGLE(srot, SROT)(int const* N, float* X, int const* incx, - float* Y, int const* incy, float* c, float* s); -void F77_BLAS_MANGLE(drot, DROT)(int const* N, double* X, int const* incx, - double* Y, int const* incy, double* c, +void F77_BLAS_MANGLE(srot, SROT)(KK_INT const* N, float* X, KK_INT const* incx, + float* Y, KK_INT const* incy, float* c, + float* s); +void F77_BLAS_MANGLE(drot, DROT)(KK_INT const* N, double* X, KK_INT const* incx, + double* Y, KK_INT const* incy, double* c, double* s); -void F77_BLAS_MANGLE(crot, CROT)(int const* N, std::complex* X, - int const* incx, std::complex* Y, - int const* incy, float* c, float* s); -void F77_BLAS_MANGLE(zrot, ZROT)(int const* N, std::complex* X, - int const* incx, std::complex* Y, - int const* incy, double* c, double* s); +void F77_BLAS_MANGLE(crot, CROT)(KK_INT const* N, std::complex* X, + KK_INT const* incx, std::complex* Y, + KK_INT const* incy, float* c, float* s); +void F77_BLAS_MANGLE(zrot, ZROT)(KK_INT const* N, std::complex* X, + KK_INT const* incx, std::complex* Y, + KK_INT const* incy, double* c, double* s); /// /// rotg @@ -172,12 +194,12 @@ void F77_BLAS_MANGLE(zrotg, ZROTG)(std::complex* a, /// /// rotm /// -void F77_BLAS_MANGLE(srotm, SROTM)(const int* n, float* X, const int* incx, - float* Y, const int* incy, - float const* param); -void F77_BLAS_MANGLE(drotm, DROTM)(const int* n, double* X, const int* incx, - double* Y, const int* incy, - double const* param); +void F77_BLAS_MANGLE(srotm, SROTM)(const KK_INT* n, float* X, + const KK_INT* incx, float* Y, + const KK_INT* incy, float const* param); +void F77_BLAS_MANGLE(drotm, DROTM)(const KK_INT* n, double* X, + const KK_INT* incx, double* Y, + const KK_INT* incy, double const* param); /// /// rotmg @@ -190,72 +212,78 @@ void F77_BLAS_MANGLE(drotmg, DROTMG)(double* d1, double* d2, double* x1, /// /// swap /// -void F77_BLAS_MANGLE(sswap, SSWAP)(int const* N, float* X, int const* incx, - float* Y, int const* incy); -void F77_BLAS_MANGLE(dswap, DSWAP)(int const* N, double* X, int const* incx, - double* Y, int const* incy); -void F77_BLAS_MANGLE(cswap, CSWAP)(int const* N, std::complex* X, - int const* incx, std::complex* Y, - int const* incy); -void F77_BLAS_MANGLE(zswap, ZSWAP)(int const* N, std::complex* X, - int const* incx, std::complex* Y, - int const* incy); +void F77_BLAS_MANGLE(sswap, SSWAP)(KK_INT const* N, float* X, + KK_INT const* incx, float* Y, + KK_INT const* incy); +void F77_BLAS_MANGLE(dswap, DSWAP)(KK_INT const* N, double* X, + KK_INT const* incx, double* Y, + KK_INT const* incy); +void F77_BLAS_MANGLE(cswap, CSWAP)(KK_INT const* N, std::complex* X, + KK_INT const* incx, std::complex* Y, + KK_INT const* incy); +void F77_BLAS_MANGLE(zswap, ZSWAP)(KK_INT const* N, std::complex* X, + KK_INT const* incx, std::complex* Y, + KK_INT const* incy); /// /// Gemv /// -void F77_BLAS_MANGLE(sgemv, SGEMV)(const char*, int*, int*, const float*, - const float*, int*, const float*, int*, +void F77_BLAS_MANGLE(sgemv, SGEMV)(const char*, KK_INT*, KK_INT*, const float*, + const float*, KK_INT*, const float*, KK_INT*, const float*, - /* */ float*, int*); -void F77_BLAS_MANGLE(dgemv, DGEMV)(const char*, int*, int*, const double*, - const double*, int*, const double*, int*, - const double*, - /* */ double*, int*); -void F77_BLAS_MANGLE(cgemv, CGEMV)(const char*, int*, int*, + /* */ float*, KK_INT*); +void F77_BLAS_MANGLE(dgemv, DGEMV)(const char*, KK_INT*, KK_INT*, const double*, + const double*, KK_INT*, const double*, + KK_INT*, const double*, + /* */ double*, KK_INT*); +void F77_BLAS_MANGLE(cgemv, CGEMV)(const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, const std::complex*, - /* */ std::complex*, int*); -void F77_BLAS_MANGLE(zgemv, ZGEMV)(const char*, int*, int*, + /* */ std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zgemv, ZGEMV)(const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, const std::complex*, - /* */ std::complex*, int*); + /* */ std::complex*, KK_INT*); /// /// Ger /// -void F77_BLAS_MANGLE(sger, SGER)(int*, int*, const float*, const float*, int*, - const float*, int*, float*, int*); -void F77_BLAS_MANGLE(dger, DGER)(int*, int*, const double*, const double*, int*, - const double*, int*, double*, int*); -void F77_BLAS_MANGLE(cgeru, CGERU)(int*, int*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, - std::complex*, int*); -void F77_BLAS_MANGLE(zgeru, ZGERU)(int*, int*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, - std::complex*, int*); -void F77_BLAS_MANGLE(cgerc, CGERC)(int*, int*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, - std::complex*, int*); -void F77_BLAS_MANGLE(zgerc, ZGERC)(int*, int*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, - std::complex*, int*); +void F77_BLAS_MANGLE(sger, SGER)(KK_INT*, KK_INT*, const float*, const float*, + KK_INT*, const float*, KK_INT*, float*, + KK_INT*); +void F77_BLAS_MANGLE(dger, DGER)(KK_INT*, KK_INT*, const double*, const double*, + KK_INT*, const double*, KK_INT*, double*, + KK_INT*); +void F77_BLAS_MANGLE(cgeru, CGERU)(KK_INT*, KK_INT*, const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zgeru, ZGERU)(KK_INT*, KK_INT*, + const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); +void F77_BLAS_MANGLE(cgerc, CGERC)(KK_INT*, KK_INT*, const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zgerc, ZGERC)(KK_INT*, KK_INT*, + const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); /// /// Syr /// -void F77_BLAS_MANGLE(ssyr, SSYR)(const char*, int*, const float*, const float*, - int*, float*, int*); -void F77_BLAS_MANGLE(dsyr, DSYR)(const char*, int*, const double*, - const double*, int*, double*, int*); +void F77_BLAS_MANGLE(ssyr, SSYR)(const char*, KK_INT*, const float*, + const float*, KK_INT*, float*, KK_INT*); +void F77_BLAS_MANGLE(dsyr, DSYR)(const char*, KK_INT*, const double*, + const double*, KK_INT*, double*, KK_INT*); // Although there is a cgeru, there is no csyru // Although there is a zgeru, there is no zsyru // Although there is a cgerc, there is no csyrc, but there is cher (see below) @@ -265,135 +293,166 @@ void F77_BLAS_MANGLE(dsyr, DSYR)(const char*, int*, const double*, /// Her /// -void F77_BLAS_MANGLE(cher, CHER)(const char*, int*, const float*, - const std::complex*, int*, - std::complex*, int*); -void F77_BLAS_MANGLE(zher, ZHER)(const char*, int*, const double*, - const std::complex*, int*, - std::complex*, int*); +void F77_BLAS_MANGLE(cher, CHER)(const char*, KK_INT*, const float*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zher, ZHER)(const char*, KK_INT*, const double*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); + +/// +/// Syr2 +/// +void F77_BLAS_MANGLE(ssyr2, SSYR2)(const char*, KK_INT*, const float*, + const float*, const KK_INT*, const float*, + KK_INT*, float*, KK_INT*); +void F77_BLAS_MANGLE(dsyr2, DSYR2)(const char*, KK_INT*, const double*, + const double*, const KK_INT*, const double*, + KK_INT*, double*, KK_INT*); +// Although there is a cgeru, there is no csyr2u +// Although there is a zgeru, there is no zsyr2u +// Although there is a cgerc, there is no csyr2c, but there is cher2 (see below) +// Although there is a zgerc, there is no zsyr2c, but there is zher2 (see below) + +/// +/// Her2 +/// + +void F77_BLAS_MANGLE(cher2, CHER2)(const char*, KK_INT*, + const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zher2, ZHER2)(const char*, KK_INT*, + const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, + std::complex*, KK_INT*); /// /// Trsv /// -void F77_BLAS_MANGLE(strsv, STRSV)(const char*, const char*, const char*, int*, - const float*, int*, - /* */ float*, int*); -void F77_BLAS_MANGLE(dtrsv, DTRSV)(const char*, const char*, const char*, int*, - const double*, int*, - /* */ double*, int*); -void F77_BLAS_MANGLE(ctrsv, CTRSV)(const char*, const char*, const char*, int*, - const std::complex*, int*, - /* */ std::complex*, int*); -void F77_BLAS_MANGLE(ztrsv, ZTRSV)(const char*, const char*, const char*, int*, - const std::complex*, int*, - /* */ std::complex*, int*); +void F77_BLAS_MANGLE(strsv, STRSV)(const char*, const char*, const char*, + KK_INT*, const float*, KK_INT*, + /* */ float*, KK_INT*); +void F77_BLAS_MANGLE(dtrsv, DTRSV)(const char*, const char*, const char*, + KK_INT*, const double*, KK_INT*, + /* */ double*, KK_INT*); +void F77_BLAS_MANGLE(ctrsv, CTRSV)(const char*, const char*, const char*, + KK_INT*, const std::complex*, KK_INT*, + /* */ std::complex*, KK_INT*); +void F77_BLAS_MANGLE(ztrsv, ZTRSV)(const char*, const char*, const char*, + KK_INT*, const std::complex*, + KK_INT*, + /* */ std::complex*, KK_INT*); /// /// Gemm /// -void F77_BLAS_MANGLE(sgemm, SGEMM)(const char*, const char*, int*, int*, int*, - const float*, const float*, int*, - const float*, int*, const float*, - /* */ float*, int*); -void F77_BLAS_MANGLE(dgemm, DGEMM)(const char*, const char*, int*, int*, int*, - const double*, const double*, int*, - const double*, int*, const double*, - /* */ double*, int*); -void F77_BLAS_MANGLE(cgemm, CGEMM)(const char*, const char*, int*, int*, int*, - const std::complex*, - const std::complex*, int*, - const std::complex*, int*, +void F77_BLAS_MANGLE(sgemm, SGEMM)(const char*, const char*, KK_INT*, KK_INT*, + KK_INT*, const float*, const float*, KK_INT*, + const float*, KK_INT*, const float*, + /* */ float*, KK_INT*); +void F77_BLAS_MANGLE(dgemm, DGEMM)(const char*, const char*, KK_INT*, KK_INT*, + KK_INT*, const double*, const double*, + KK_INT*, const double*, KK_INT*, + const double*, + /* */ double*, KK_INT*); +void F77_BLAS_MANGLE(cgemm, CGEMM)(const char*, const char*, KK_INT*, KK_INT*, + KK_INT*, const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, const std::complex*, - /* */ std::complex*, int*); -void F77_BLAS_MANGLE(zgemm, ZGEMM)(const char*, const char*, int*, int*, int*, + /* */ std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zgemm, ZGEMM)(const char*, const char*, KK_INT*, KK_INT*, + KK_INT*, const std::complex*, + const std::complex*, KK_INT*, + const std::complex*, KK_INT*, const std::complex*, - const std::complex*, int*, - const std::complex*, int*, - const std::complex*, - /* */ std::complex*, int*); + /* */ std::complex*, KK_INT*); /// /// Herk /// -void F77_BLAS_MANGLE(ssyrk, SSYRK)(const char*, const char*, int*, int*, - const float*, const float*, int*, +void F77_BLAS_MANGLE(ssyrk, SSYRK)(const char*, const char*, KK_INT*, KK_INT*, + const float*, const float*, KK_INT*, const float*, - /* */ float*, int*); -void F77_BLAS_MANGLE(dsyrk, DSYRK)(const char*, const char*, int*, int*, - const double*, const double*, int*, + /* */ float*, KK_INT*); +void F77_BLAS_MANGLE(dsyrk, DSYRK)(const char*, const char*, KK_INT*, KK_INT*, + const double*, const double*, KK_INT*, const double*, - /* */ double*, int*); -void F77_BLAS_MANGLE(cherk, CHERK)(const char*, const char*, int*, int*, + /* */ double*, KK_INT*); +void F77_BLAS_MANGLE(cherk, CHERK)(const char*, const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, + const std::complex*, KK_INT*, const std::complex*, - /* */ std::complex*, int*); -void F77_BLAS_MANGLE(zherk, ZHERK)(const char*, const char*, int*, int*, + /* */ std::complex*, KK_INT*); +void F77_BLAS_MANGLE(zherk, ZHERK)(const char*, const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, + const std::complex*, KK_INT*, const std::complex*, - /* */ std::complex*, int*); + /* */ std::complex*, KK_INT*); /// /// Trmm /// void F77_BLAS_MANGLE(strmm, STRMM)(const char*, const char*, const char*, - const char*, int*, int*, const float*, - const float*, int*, - /* */ float*, int*); + const char*, KK_INT*, KK_INT*, const float*, + const float*, KK_INT*, + /* */ float*, KK_INT*); void F77_BLAS_MANGLE(dtrmm, DTRMM)(const char*, const char*, const char*, - const char*, int*, int*, const double*, - const double*, int*, - /* */ double*, int*); + const char*, KK_INT*, KK_INT*, const double*, + const double*, KK_INT*, + /* */ double*, KK_INT*); void F77_BLAS_MANGLE(ctrmm, CTRMM)(const char*, const char*, const char*, - const char*, int*, int*, + const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, - /* */ std::complex*, int*); + const std::complex*, KK_INT*, + /* */ std::complex*, KK_INT*); void F77_BLAS_MANGLE(ztrmm, ZTRMM)(const char*, const char*, const char*, - const char*, int*, int*, + const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, - /* */ std::complex*, int*); + const std::complex*, KK_INT*, + /* */ std::complex*, KK_INT*); /// /// Trsm /// void F77_BLAS_MANGLE(strsm, STRSM)(const char*, const char*, const char*, - const char*, int*, int*, const float*, - const float*, int*, - /* */ float*, int*); + const char*, KK_INT*, KK_INT*, const float*, + const float*, KK_INT*, + /* */ float*, KK_INT*); void F77_BLAS_MANGLE(dtrsm, DTRSM)(const char*, const char*, const char*, - const char*, int*, int*, const double*, - const double*, int*, - /* */ double*, int*); + const char*, KK_INT*, KK_INT*, const double*, + const double*, KK_INT*, + /* */ double*, KK_INT*); void F77_BLAS_MANGLE(ctrsm, CTRSM)(const char*, const char*, const char*, - const char*, int*, int*, + const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, - /* */ std::complex*, int*); + const std::complex*, KK_INT*, + /* */ std::complex*, KK_INT*); void F77_BLAS_MANGLE(ztrsm, ZTRSM)(const char*, const char*, const char*, - const char*, int*, int*, + const char*, KK_INT*, KK_INT*, const std::complex*, - const std::complex*, int*, - /* */ std::complex*, int*); + const std::complex*, KK_INT*, + /* */ std::complex*, KK_INT*); } -void F77_BLAS_MANGLE(sscal, SSCAL)(const int* N, const float* alpha, - /* */ float* x, const int* x_inc); -void F77_BLAS_MANGLE(dscal, DSCAL)(const int* N, const double* alpha, - /* */ double* x, const int* x_inc); +void F77_BLAS_MANGLE(sscal, SSCAL)(const KK_INT* N, const float* alpha, + /* */ float* x, const KK_INT* x_inc); +void F77_BLAS_MANGLE(dscal, DSCAL)(const KK_INT* N, const double* alpha, + /* */ double* x, const KK_INT* x_inc); void F77_BLAS_MANGLE(cscal, - CSCAL)(const int* N, const std::complex* alpha, - /* */ std::complex* x, const int* x_inc); + CSCAL)(const KK_INT* N, const std::complex* alpha, + /* */ std::complex* x, const KK_INT* x_inc); void F77_BLAS_MANGLE(zscal, - ZSCAL)(const int* N, const std::complex* alpha, - /* */ std::complex* x, const int* x_inc); + ZSCAL)(const KK_INT* N, const std::complex* alpha, + /* */ std::complex* x, const KK_INT* x_inc); #define F77_FUNC_SSCAL F77_BLAS_MANGLE(sscal, SSCAL) #define F77_FUNC_DSCAL F77_BLAS_MANGLE(dscal, DSCAL) @@ -466,6 +525,12 @@ void F77_BLAS_MANGLE(zscal, #define F77_FUNC_CHER F77_BLAS_MANGLE(cher, CHER) #define F77_FUNC_ZHER F77_BLAS_MANGLE(zher, ZHER) +#define F77_FUNC_SSYR2 F77_BLAS_MANGLE(ssyr2, SSYR2) +#define F77_FUNC_DSYR2 F77_BLAS_MANGLE(dsyr2, DSYR2) + +#define F77_FUNC_CHER2 F77_BLAS_MANGLE(cher2, CHER2) +#define F77_FUNC_ZHER2 F77_BLAS_MANGLE(zher2, ZHER2) + #define F77_FUNC_STRSV F77_BLAS_MANGLE(strsv, STRSV) #define F77_FUNC_DTRSV F77_BLAS_MANGLE(dtrsv, DTRSV) #define F77_FUNC_CTRSV F77_BLAS_MANGLE(ctrsv, CTRSV) @@ -499,35 +564,36 @@ namespace Impl { /// template <> -void HostBlas::scal(int n, const float alpha, - /* */ float* x, int x_inc) { +void HostBlas::scal(KK_INT n, const float alpha, + /* */ float* x, KK_INT x_inc) { F77_FUNC_SSCAL(&n, &alpha, x, &x_inc); } template <> -int HostBlas::iamax(int n, const float* x, int x_inc) { +KK_INT HostBlas::iamax(KK_INT n, const float* x, KK_INT x_inc) { return F77_FUNC_ISAMAX(&n, x, &x_inc); } template <> -float HostBlas::nrm2(int n, const float* x, int x_inc) { +float HostBlas::nrm2(KK_INT n, const float* x, KK_INT x_inc) { return F77_FUNC_SNRM2(&n, x, &x_inc); } template <> -float HostBlas::asum(int n, const float* x, int x_inc) { +float HostBlas::asum(KK_INT n, const float* x, KK_INT x_inc) { return F77_FUNC_SASUM(&n, x, &x_inc); } template <> -float HostBlas::dot(int n, const float* x, int x_inc, const float* y, - int y_inc) { +float HostBlas::dot(KK_INT n, const float* x, KK_INT x_inc, + const float* y, KK_INT y_inc) { return F77_FUNC_SDOT(&n, x, &x_inc, y, &y_inc); } template <> -void HostBlas::axpy(int n, const float alpha, const float* x, int x_inc, - /* */ float* y, int y_inc) { +void HostBlas::axpy(KK_INT n, const float alpha, const float* x, + KK_INT x_inc, + /* */ float* y, KK_INT y_inc) { F77_FUNC_SAXPY(&n, &alpha, x, &x_inc, y, &y_inc); } template <> -void HostBlas::rot(int const N, float* X, int const incx, float* Y, - int const incy, float* c, float* s) { +void HostBlas::rot(KK_INT const N, float* X, KK_INT const incx, float* Y, + KK_INT const incy, float* c, float* s) { F77_FUNC_SROT(&N, X, &incx, Y, &incy, c, s); } template <> @@ -535,8 +601,8 @@ void HostBlas::rotg(float* a, float* b, float* c, float* s) { F77_FUNC_SROTG(a, b, c, s); } template <> -void HostBlas::rotm(const int n, float* X, const int incx, float* Y, - const int incy, const float* param) { +void HostBlas::rotm(const KK_INT n, float* X, const KK_INT incx, + float* Y, const KK_INT incy, const float* param) { F77_FUNC_SROTM(&n, X, &incx, Y, &incy, param); } template <> @@ -545,62 +611,69 @@ void HostBlas::rotmg(float* d1, float* d2, float* x1, const float* y1, F77_FUNC_SROTMG(d1, d2, x1, y1, param); } template <> -void HostBlas::swap(int const N, float* X, int const incx, float* Y, - int const incy) { +void HostBlas::swap(KK_INT const N, float* X, KK_INT const incx, + float* Y, KK_INT const incy) { F77_FUNC_SSWAP(&N, X, &incx, Y, &incy); } template <> -void HostBlas::gemv(const char trans, int m, int n, const float alpha, - const float* a, int lda, const float* b, int ldb, - const float beta, - /* */ float* c, int ldc) { +void HostBlas::gemv(const char trans, KK_INT m, KK_INT n, + const float alpha, const float* a, KK_INT lda, + const float* b, KK_INT ldb, const float beta, + /* */ float* c, KK_INT ldc) { F77_FUNC_SGEMV(&trans, &m, &n, &alpha, a, &lda, b, &ldb, &beta, c, &ldc); } template <> -void HostBlas::ger(int m, int n, const float alpha, const float* x, - int incx, const float* y, int incy, float* a, - int lda) { +void HostBlas::ger(KK_INT m, KK_INT n, const float alpha, const float* x, + KK_INT incx, const float* y, KK_INT incy, float* a, + KK_INT lda) { F77_FUNC_SGER(&m, &n, &alpha, x, &incx, y, &incy, a, &lda); } template <> -void HostBlas::syr(const char uplo, int n, const float alpha, - const float* x, int incx, float* a, int lda) { +void HostBlas::syr(const char uplo, KK_INT n, const float alpha, + const float* x, KK_INT incx, float* a, KK_INT lda) { F77_FUNC_SSYR(&uplo, &n, &alpha, x, &incx, a, &lda); } template <> +void HostBlas::syr2(const char uplo, KK_INT n, const float alpha, + const float* x, KK_INT incx, const float* y, + KK_INT incy, float* a, KK_INT lda) { + F77_FUNC_SSYR2(&uplo, &n, &alpha, x, &incx, y, &incy, a, &lda); +} +template <> void HostBlas::trsv(const char uplo, const char transa, const char diag, - int m, const float* a, int lda, - /* */ float* b, int ldb) { + KK_INT m, const float* a, KK_INT lda, + /* */ float* b, KK_INT ldb) { F77_FUNC_STRSV(&uplo, &transa, &diag, &m, a, &lda, b, &ldb); } template <> -void HostBlas::gemm(const char transa, const char transb, int m, int n, - int k, const float alpha, const float* a, int lda, - const float* b, int ldb, const float beta, - /* */ float* c, int ldc) { +void HostBlas::gemm(const char transa, const char transb, KK_INT m, + KK_INT n, KK_INT k, const float alpha, + const float* a, KK_INT lda, const float* b, + KK_INT ldb, const float beta, + /* */ float* c, KK_INT ldc) { F77_FUNC_SGEMM(&transa, &transb, &m, &n, &k, &alpha, a, &lda, b, &ldb, &beta, c, &ldc); } template <> -void HostBlas::herk(const char transa, const char transb, int n, int k, - const float alpha, const float* a, int lda, - const float beta, - /* */ float* c, int ldc) { +void HostBlas::herk(const char transa, const char transb, KK_INT n, + KK_INT k, const float alpha, const float* a, + KK_INT lda, const float beta, + /* */ float* c, KK_INT ldc) { F77_FUNC_SSYRK(&transa, &transb, &n, &k, &alpha, a, &lda, &beta, c, &ldc); } template <> void HostBlas::trmm(const char side, const char uplo, const char transa, - const char diag, int m, int n, const float alpha, - const float* a, int lda, - /* */ float* b, int ldb) { + const char diag, KK_INT m, KK_INT n, + const float alpha, const float* a, KK_INT lda, + /* */ float* b, KK_INT ldb) { F77_FUNC_STRMM(&side, &uplo, &transa, &diag, &m, &n, &alpha, a, &lda, b, &ldb); } template <> void HostBlas::trsm(const char side, const char uplo, const char transa, - const char diag, int m, int n, const float alpha, - const float* a, int lda, - /* */ float* b, int ldb) { + const char diag, KK_INT m, KK_INT n, + const float alpha, const float* a, KK_INT lda, + /* */ float* b, KK_INT ldb) { F77_FUNC_STRSM(&side, &uplo, &transa, &diag, &m, &n, &alpha, a, &lda, b, &ldb); } @@ -610,36 +683,36 @@ void HostBlas::trsm(const char side, const char uplo, const char transa, /// template <> -void HostBlas::scal(int n, const double alpha, - /* */ double* x, int x_inc) { +void HostBlas::scal(KK_INT n, const double alpha, + /* */ double* x, KK_INT x_inc) { F77_FUNC_DSCAL(&n, &alpha, x, &x_inc); } template <> -int HostBlas::iamax(int n, const double* x, int x_inc) { +KK_INT HostBlas::iamax(KK_INT n, const double* x, KK_INT x_inc) { return F77_FUNC_IDAMAX(&n, x, &x_inc); } template <> -double HostBlas::nrm2(int n, const double* x, int x_inc) { +double HostBlas::nrm2(KK_INT n, const double* x, KK_INT x_inc) { return F77_FUNC_DNRM2(&n, x, &x_inc); } template <> -double HostBlas::asum(int n, const double* x, int x_inc) { +double HostBlas::asum(KK_INT n, const double* x, KK_INT x_inc) { return F77_FUNC_DASUM(&n, x, &x_inc); } template <> -double HostBlas::dot(int n, const double* x, int x_inc, const double* y, - int y_inc) { +double HostBlas::dot(KK_INT n, const double* x, KK_INT x_inc, + const double* y, KK_INT y_inc) { return F77_FUNC_DDOT(&n, x, &x_inc, y, &y_inc); } template <> -void HostBlas::axpy(int n, const double alpha, const double* x, - int x_inc, - /* */ double* y, int y_inc) { +void HostBlas::axpy(KK_INT n, const double alpha, const double* x, + KK_INT x_inc, + /* */ double* y, KK_INT y_inc) { F77_FUNC_DAXPY(&n, &alpha, x, &x_inc, y, &y_inc); } template <> -void HostBlas::rot(int const N, double* X, int const incx, double* Y, - int const incy, double* c, double* s) { +void HostBlas::rot(KK_INT const N, double* X, KK_INT const incx, + double* Y, KK_INT const incy, double* c, double* s) { F77_FUNC_DROT(&N, X, &incx, Y, &incy, c, s); } template <> @@ -647,8 +720,8 @@ void HostBlas::rotg(double* a, double* b, double* c, double* s) { F77_FUNC_DROTG(a, b, c, s); } template <> -void HostBlas::rotm(const int n, double* X, const int incx, double* Y, - const int incy, const double* param) { +void HostBlas::rotm(const KK_INT n, double* X, const KK_INT incx, + double* Y, const KK_INT incy, const double* param) { F77_FUNC_DROTM(&n, X, &incx, Y, &incy, param); } template <> @@ -657,62 +730,70 @@ void HostBlas::rotmg(double* d1, double* d2, double* x1, F77_FUNC_DROTMG(d1, d2, x1, y1, param); } template <> -void HostBlas::swap(int const N, double* X, int const incx, double* Y, - int const incy) { +void HostBlas::swap(KK_INT const N, double* X, KK_INT const incx, + double* Y, KK_INT const incy) { F77_FUNC_DSWAP(&N, X, &incx, Y, &incy); } template <> -void HostBlas::gemv(const char trans, int m, int n, const double alpha, - const double* a, int lda, const double* b, int ldb, - const double beta, - /* */ double* c, int ldc) { +void HostBlas::gemv(const char trans, KK_INT m, KK_INT n, + const double alpha, const double* a, KK_INT lda, + const double* b, KK_INT ldb, const double beta, + /* */ double* c, KK_INT ldc) { F77_FUNC_DGEMV(&trans, &m, &n, &alpha, a, &lda, b, &ldb, &beta, c, &ldc); } template <> -void HostBlas::ger(int m, int n, const double alpha, const double* x, - int incx, const double* y, int incy, double* a, - int lda) { +void HostBlas::ger(KK_INT m, KK_INT n, const double alpha, + const double* x, KK_INT incx, const double* y, + KK_INT incy, double* a, KK_INT lda) { F77_FUNC_DGER(&m, &n, &alpha, x, &incx, y, &incy, a, &lda); } template <> -void HostBlas::syr(const char uplo, int n, const double alpha, - const double* x, int incx, double* a, int lda) { +void HostBlas::syr(const char uplo, KK_INT n, const double alpha, + const double* x, KK_INT incx, double* a, + KK_INT lda) { F77_FUNC_DSYR(&uplo, &n, &alpha, x, &incx, a, &lda); } template <> +void HostBlas::syr2(const char uplo, KK_INT n, const double alpha, + const double* x, KK_INT incx, const double* y, + KK_INT incy, double* a, KK_INT lda) { + F77_FUNC_DSYR2(&uplo, &n, &alpha, x, &incx, y, &incy, a, &lda); +} +template <> void HostBlas::trsv(const char uplo, const char transa, const char diag, - int m, const double* a, int lda, - /* */ double* b, int ldb) { + KK_INT m, const double* a, KK_INT lda, + /* */ double* b, KK_INT ldb) { F77_FUNC_DTRSV(&uplo, &transa, &diag, &m, a, &lda, b, &ldb); } template <> -void HostBlas::gemm(const char transa, const char transb, int m, int n, - int k, const double alpha, const double* a, int lda, - const double* b, int ldb, const double beta, - /* */ double* c, int ldc) { +void HostBlas::gemm(const char transa, const char transb, KK_INT m, + KK_INT n, KK_INT k, const double alpha, + const double* a, KK_INT lda, const double* b, + KK_INT ldb, const double beta, + /* */ double* c, KK_INT ldc) { F77_FUNC_DGEMM(&transa, &transb, &m, &n, &k, &alpha, a, &lda, b, &ldb, &beta, c, &ldc); } template <> -void HostBlas::herk(const char transa, const char transb, int n, int k, - const double alpha, const double* a, int lda, - const double beta, - /* */ double* c, int ldc) { +void HostBlas::herk(const char transa, const char transb, KK_INT n, + KK_INT k, const double alpha, const double* a, + KK_INT lda, const double beta, + /* */ double* c, KK_INT ldc) { F77_FUNC_DSYRK(&transa, &transb, &n, &k, &alpha, a, &lda, &beta, c, &ldc); } template <> void HostBlas::trmm(const char side, const char uplo, const char transa, - const char diag, int m, int n, const double alpha, - const double* a, int lda, - /* */ double* b, int ldb) { + const char diag, KK_INT m, KK_INT n, + const double alpha, const double* a, KK_INT lda, + /* */ double* b, KK_INT ldb) { F77_FUNC_DTRMM(&side, &uplo, &transa, &diag, &m, &n, &alpha, a, &lda, b, &ldb); } template <> void HostBlas::trsm(const char side, const char uplo, const char transa, - const char diag, int m, int n, const double alpha, - const double* a, int lda, - /* */ double* b, int ldb) { + const char diag, KK_INT m, KK_INT n, + const double alpha, const double* a, KK_INT lda, + /* */ double* b, KK_INT ldb) { F77_FUNC_DTRSM(&side, &uplo, &transa, &diag, &m, &n, &alpha, a, &lda, b, &ldb); } @@ -722,33 +803,37 @@ void HostBlas::trsm(const char side, const char uplo, const char transa, /// template <> -void HostBlas >::scal(int n, +void HostBlas >::scal(KK_INT n, const std::complex alpha, /* */ std::complex* x, - int x_inc) { + KK_INT x_inc) { F77_FUNC_CSCAL(&n, &alpha, x, &x_inc); } template <> -int HostBlas >::iamax(int n, const std::complex* x, - int x_inc) { +KK_INT HostBlas >::iamax(KK_INT n, + const std::complex* x, + KK_INT x_inc) { return F77_FUNC_ICAMAX(&n, x, &x_inc); } template <> -float HostBlas >::nrm2(int n, const std::complex* x, - int x_inc) { +float HostBlas >::nrm2(KK_INT n, + const std::complex* x, + KK_INT x_inc) { return F77_FUNC_SCNRM2(&n, x, &x_inc); } template <> -float HostBlas >::asum(int n, const std::complex* x, - int x_inc) { +float HostBlas >::asum(KK_INT n, + const std::complex* x, + KK_INT x_inc) { return F77_FUNC_SCASUM(&n, x, &x_inc); } template <> std::complex HostBlas >::dot( - int n, const std::complex* x, int x_inc, - const std::complex* y, int y_inc) { + KK_INT n, const std::complex* x, KK_INT x_inc, + const std::complex* y, KK_INT y_inc) { #if defined(KOKKOSKERNELS_TPL_BLAS_RETURN_COMPLEX) - return F77_FUNC_CDOTC(&n, x, &x_inc, y, &y_inc); + _kk_float2 res = F77_FUNC_CDOTC(&n, x, &x_inc, y, &y_inc); + return std::complex(res.vals[0], res.vals[1]); #else std::complex res; F77_FUNC_CDOTC(&res, &n, x, &x_inc, y, &y_inc); @@ -756,18 +841,20 @@ std::complex HostBlas >::dot( #endif } template <> -void HostBlas >::axpy(int n, +void HostBlas >::axpy(KK_INT n, const std::complex alpha, const std::complex* x, - int x_inc, + KK_INT x_inc, /* */ std::complex* y, - int y_inc) { + KK_INT y_inc) { F77_FUNC_CAXPY(&n, &alpha, x, &x_inc, y, &y_inc); } template <> -void HostBlas >::rot(int const N, std::complex* X, - int const incx, std::complex* Y, - int const incy, float* c, float* s) { +void HostBlas >::rot(KK_INT const N, std::complex* X, + KK_INT const incx, + std::complex* Y, + KK_INT const incy, float* c, + float* s) { F77_FUNC_CROT(&N, X, &incx, Y, &incy, c, s); } template <> @@ -777,38 +864,37 @@ void HostBlas >::rotg(std::complex* a, F77_FUNC_CROTG(a, b, c, s); } template <> -void HostBlas >::swap(int const N, std::complex* X, - int const incx, +void HostBlas >::swap(KK_INT const N, + std::complex* X, + KK_INT const incx, std::complex* Y, - int const incy) { + KK_INT const incy) { F77_FUNC_CSWAP(&N, X, &incx, Y, &incy); } template <> -void HostBlas >::gemv(const char trans, int m, int n, - const std::complex alpha, - const std::complex* a, int lda, - const std::complex* b, int ldb, - const std::complex beta, - /* */ std::complex* c, - int ldc) { +void HostBlas >::gemv( + const char trans, KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* a, KK_INT lda, const std::complex* b, + KK_INT ldb, const std::complex beta, + /* */ std::complex* c, KK_INT ldc) { F77_FUNC_CGEMV(&trans, &m, &n, &alpha, (const std::complex*)a, &lda, (const std::complex*)b, &ldb, &beta, (std::complex*)c, &ldc); } template <> void HostBlas >::geru( - int m, int n, const std::complex alpha, const std::complex* x, - int incx, const std::complex* y, int incy, std::complex* a, - int lda) { + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* x, KK_INT incx, const std::complex* y, + KK_INT incy, std::complex* a, KK_INT lda) { F77_FUNC_CGERU(&m, &n, &alpha, (const std::complex*)x, &incx, (const std::complex*)y, &incy, (std::complex*)a, &lda); } template <> void HostBlas >::gerc( - int m, int n, const std::complex alpha, const std::complex* x, - int incx, const std::complex* y, int incy, std::complex* a, - int lda) { + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* x, KK_INT incx, const std::complex* y, + KK_INT incy, std::complex* a, KK_INT lda) { F77_FUNC_CGERC(&m, &n, &alpha, (const std::complex*)x, &incx, (const std::complex*)y, &incy, (std::complex*)a, &lda); @@ -816,63 +902,67 @@ void HostBlas >::gerc( template <> template <> void HostBlas >::cher( - const char uplo, int n, const float alpha, const std::complex* x, - int incx, std::complex* a, int lda) { + const char uplo, KK_INT n, const float alpha, const std::complex* x, + KK_INT incx, std::complex* a, KK_INT lda) { F77_FUNC_CHER(&uplo, &n, &alpha, (const std::complex*)x, &incx, (std::complex*)a, &lda); } template <> +void HostBlas >::cher2( + const char uplo, KK_INT n, const std::complex alpha, + const std::complex* x, KK_INT incx, const std::complex* y, + KK_INT incy, std::complex* a, KK_INT lda) { + F77_FUNC_CHER2(&uplo, &n, &alpha, (const std::complex*)x, &incx, + (const std::complex*)y, &incy, (std::complex*)a, + &lda); +} +template <> void HostBlas >::trsv(const char uplo, const char transa, - const char diag, int m, - const std::complex* a, int lda, + const char diag, KK_INT m, + const std::complex* a, + KK_INT lda, /* */ std::complex* b, - int ldb) { + KK_INT ldb) { F77_FUNC_CTRSV(&uplo, &transa, &diag, &m, (const std::complex*)a, &lda, (std::complex*)b, &ldb); } template <> void HostBlas >::gemm( - const char transa, const char transb, int m, int n, int k, - const std::complex alpha, const std::complex* a, int lda, - const std::complex* b, int ldb, const std::complex beta, - /* */ std::complex* c, int ldc) { + const char transa, const char transb, KK_INT m, KK_INT n, KK_INT k, + const std::complex alpha, const std::complex* a, KK_INT lda, + const std::complex* b, KK_INT ldb, const std::complex beta, + /* */ std::complex* c, KK_INT ldc) { F77_FUNC_CGEMM(&transa, &transb, &m, &n, &k, &alpha, (const std::complex*)a, &lda, (const std::complex*)b, &ldb, &beta, (std::complex*)c, &ldc); } template <> -void HostBlas >::herk(const char transa, const char transb, - int n, int k, - const std::complex alpha, - const std::complex* a, int lda, - const std::complex beta, - /* */ std::complex* c, - int ldc) { +void HostBlas >::herk( + const char transa, const char transb, KK_INT n, KK_INT k, + const std::complex alpha, const std::complex* a, KK_INT lda, + const std::complex beta, + /* */ std::complex* c, KK_INT ldc) { F77_FUNC_CHERK(&transa, &transb, &n, &k, &alpha, (const std::complex*)a, &lda, &beta, (std::complex*)c, &ldc); } template <> -void HostBlas >::trmm(const char side, const char uplo, - const char transa, const char diag, - int m, int n, - const std::complex alpha, - const std::complex* a, int lda, - /* */ std::complex* b, - int ldb) { +void HostBlas >::trmm( + const char side, const char uplo, const char transa, const char diag, + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* a, KK_INT lda, + /* */ std::complex* b, KK_INT ldb) { F77_FUNC_CTRMM(&side, &uplo, &transa, &diag, &m, &n, &alpha, (const std::complex*)a, &lda, (std::complex*)b, &ldb); } template <> -void HostBlas >::trsm(const char side, const char uplo, - const char transa, const char diag, - int m, int n, - const std::complex alpha, - const std::complex* a, int lda, - /* */ std::complex* b, - int ldb) { +void HostBlas >::trsm( + const char side, const char uplo, const char transa, const char diag, + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* a, KK_INT lda, + /* */ std::complex* b, KK_INT ldb) { F77_FUNC_CTRSM(&side, &uplo, &transa, &diag, &m, &n, &alpha, (const std::complex*)a, &lda, (std::complex*)b, &ldb); @@ -883,35 +973,37 @@ void HostBlas >::trsm(const char side, const char uplo, /// template <> -void HostBlas >::scal(int n, +void HostBlas >::scal(KK_INT n, const std::complex alpha, /* */ std::complex* x, - int x_inc) { + KK_INT x_inc) { F77_FUNC_ZSCAL(&n, &alpha, x, &x_inc); } template <> -int HostBlas >::iamax(int n, const std::complex* x, - int x_inc) { +KK_INT HostBlas >::iamax(KK_INT n, + const std::complex* x, + KK_INT x_inc) { return F77_FUNC_IZAMAX(&n, x, &x_inc); } template <> -double HostBlas >::nrm2(int n, +double HostBlas >::nrm2(KK_INT n, const std::complex* x, - int x_inc) { + KK_INT x_inc) { return F77_FUNC_DZNRM2(&n, x, &x_inc); } template <> -double HostBlas >::asum(int n, +double HostBlas >::asum(KK_INT n, const std::complex* x, - int x_inc) { + KK_INT x_inc) { return F77_FUNC_DZASUM(&n, x, &x_inc); } template <> std::complex HostBlas >::dot( - int n, const std::complex* x, int x_inc, - const std::complex* y, int y_inc) { + KK_INT n, const std::complex* x, KK_INT x_inc, + const std::complex* y, KK_INT y_inc) { #if defined(KOKKOSKERNELS_TPL_BLAS_RETURN_COMPLEX) - return F77_FUNC_ZDOTC(&n, x, &x_inc, y, &y_inc); + _kk_double2 res = F77_FUNC_ZDOTC(&n, x, &x_inc, y, &y_inc); + return std::complex(res.vals[0], res.vals[1]); #else std::complex res; F77_FUNC_ZDOTC(&res, &n, x, &x_inc, y, &y_inc); @@ -919,20 +1011,18 @@ std::complex HostBlas >::dot( #endif } template <> -void HostBlas >::axpy(int n, +void HostBlas >::axpy(KK_INT n, const std::complex alpha, const std::complex* x, - int x_inc, + KK_INT x_inc, /* */ std::complex* y, - int y_inc) { + KK_INT y_inc) { F77_FUNC_ZAXPY(&n, &alpha, x, &x_inc, y, &y_inc); } template <> -void HostBlas >::rot(int const N, std::complex* X, - int const incx, - std::complex* Y, - int const incy, double* c, - double* s) { +void HostBlas >::rot( + KK_INT const N, std::complex* X, KK_INT const incx, + std::complex* Y, KK_INT const incy, double* c, double* s) { F77_FUNC_ZROT(&N, X, &incx, Y, &incy, c, s); } template <> @@ -942,36 +1032,37 @@ void HostBlas >::rotg(std::complex* a, F77_FUNC_ZROTG(a, b, c, s); } template <> -void HostBlas >::swap(int const N, std::complex* X, - int const incx, +void HostBlas >::swap(KK_INT const N, + std::complex* X, + KK_INT const incx, std::complex* Y, - int const incy) { + KK_INT const incy) { F77_FUNC_ZSWAP(&N, X, &incx, Y, &incy); } template <> void HostBlas >::gemv( - const char trans, int m, int n, const std::complex alpha, - const std::complex* a, int lda, const std::complex* b, - int ldb, const std::complex beta, - /* */ std::complex* c, int ldc) { + const char trans, KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* a, KK_INT lda, const std::complex* b, + KK_INT ldb, const std::complex beta, + /* */ std::complex* c, KK_INT ldc) { F77_FUNC_ZGEMV(&trans, &m, &n, &alpha, (const std::complex*)a, &lda, (const std::complex*)b, &ldb, &beta, (std::complex*)c, &ldc); } template <> void HostBlas >::geru( - int m, int n, const std::complex alpha, - const std::complex* x, int incx, const std::complex* y, - int incy, std::complex* a, int lda) { + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* x, KK_INT incx, const std::complex* y, + KK_INT incy, std::complex* a, KK_INT lda) { F77_FUNC_ZGERU(&m, &n, &alpha, (const std::complex*)x, &incx, (const std::complex*)y, &incy, (std::complex*)a, &lda); } template <> void HostBlas >::gerc( - int m, int n, const std::complex alpha, - const std::complex* x, int incx, const std::complex* y, - int incy, std::complex* a, int lda) { + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* x, KK_INT incx, const std::complex* y, + KK_INT incy, std::complex* a, KK_INT lda) { F77_FUNC_ZGERC(&m, &n, &alpha, (const std::complex*)x, &incx, (const std::complex*)y, &incy, (std::complex*)a, &lda); @@ -979,28 +1070,38 @@ void HostBlas >::gerc( template <> template <> void HostBlas >::zher( - const char uplo, int n, const double alpha, const std::complex* x, - int incx, std::complex* a, int lda) { + const char uplo, KK_INT n, const double alpha, + const std::complex* x, KK_INT incx, std::complex* a, + KK_INT lda) { F77_FUNC_ZHER(&uplo, &n, &alpha, (const std::complex*)x, &incx, (std::complex*)a, &lda); } template <> +void HostBlas >::zher2( + const char uplo, KK_INT n, const std::complex alpha, + const std::complex* x, KK_INT incx, const std::complex* y, + KK_INT incy, std::complex* a, KK_INT lda) { + F77_FUNC_ZHER2(&uplo, &n, &alpha, (const std::complex*)x, &incx, + (const std::complex*)y, &incy, + (std::complex*)a, &lda); +} +template <> void HostBlas >::trsv(const char uplo, const char transa, - const char diag, int m, + const char diag, KK_INT m, const std::complex* a, - int lda, + KK_INT lda, /* */ std::complex* b, - int ldb) { + KK_INT ldb) { F77_FUNC_ZTRSV(&uplo, &transa, &diag, &m, (const std::complex*)a, &lda, (std::complex*)b, &ldb); } template <> void HostBlas >::gemm( - const char transa, const char transb, int m, int n, int k, - const std::complex alpha, const std::complex* a, int lda, - const std::complex* b, int ldb, const std::complex beta, - /* */ std::complex* c, int ldc) { + const char transa, const char transb, KK_INT m, KK_INT n, KK_INT k, + const std::complex alpha, const std::complex* a, KK_INT lda, + const std::complex* b, KK_INT ldb, const std::complex beta, + /* */ std::complex* c, KK_INT ldc) { F77_FUNC_ZGEMM(&transa, &transb, &m, &n, &k, &alpha, (const std::complex*)a, &lda, (const std::complex*)b, &ldb, &beta, @@ -1008,30 +1109,30 @@ void HostBlas >::gemm( } template <> void HostBlas >::herk( - const char transa, const char transb, int n, int k, - const std::complex alpha, const std::complex* a, int lda, + const char transa, const char transb, KK_INT n, KK_INT k, + const std::complex alpha, const std::complex* a, KK_INT lda, const std::complex beta, - /* */ std::complex* c, int ldc) { + /* */ std::complex* c, KK_INT ldc) { F77_FUNC_ZHERK(&transa, &transb, &n, &k, &alpha, (const std::complex*)a, &lda, &beta, (std::complex*)c, &ldc); } template <> void HostBlas >::trmm( - const char side, const char uplo, const char transa, const char diag, int m, - int n, const std::complex alpha, const std::complex* a, - int lda, - /* */ std::complex* b, int ldb) { + const char side, const char uplo, const char transa, const char diag, + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* a, KK_INT lda, + /* */ std::complex* b, KK_INT ldb) { F77_FUNC_ZTRMM(&side, &uplo, &transa, &diag, &m, &n, &alpha, (const std::complex*)a, &lda, (std::complex*)b, &ldb); } template <> void HostBlas >::trsm( - const char side, const char uplo, const char transa, const char diag, int m, - int n, const std::complex alpha, const std::complex* a, - int lda, - /* */ std::complex* b, int ldb) { + const char side, const char uplo, const char transa, const char diag, + KK_INT m, KK_INT n, const std::complex alpha, + const std::complex* a, KK_INT lda, + /* */ std::complex* b, KK_INT ldb) { F77_FUNC_ZTRSM(&side, &uplo, &transa, &diag, &m, &n, &alpha, (const std::complex*)a, &lda, (std::complex*)b, &ldb); diff --git a/blas/tpls/KokkosBlas_Host_tpl.hpp b/blas/tpls/KokkosBlas_Host_tpl.hpp index 06a5620155..5fb7c1f624 100644 --- a/blas/tpls/KokkosBlas_Host_tpl.hpp +++ b/blas/tpls/KokkosBlas_Host_tpl.hpp @@ -25,87 +25,106 @@ #include "Kokkos_ArithTraits.hpp" #if defined(KOKKOSKERNELS_ENABLE_TPL_BLAS) +#if defined(KOKKOSKERNELS_ENABLE_TPL_MKL) +#include "mkl_types.h" +#endif namespace KokkosBlas { namespace Impl { +#if defined(KOKKOSKERNELS_ENABLE_TPL_MKL) +using KK_INT = MKL_INT; +#else +using KK_INT = int; +#endif + template struct HostBlas { typedef Kokkos::ArithTraits ats; typedef typename ats::mag_type mag_type; - static void scal(int n, const T alpha, - /* */ T *x, int x_inc); + static void scal(KK_INT n, const T alpha, + /* */ T *x, KK_INT x_inc); - static int iamax(int n, const T *x, int x_inc); + static KK_INT iamax(KK_INT n, const T *x, KK_INT x_inc); - static mag_type nrm2(int n, const T *x, int x_inc); + static mag_type nrm2(KK_INT n, const T *x, KK_INT x_inc); - static mag_type asum(int n, const T *x, int x_inc); + static mag_type asum(KK_INT n, const T *x, KK_INT x_inc); - static T dot(int n, const T *x, int x_inc, const T *y, int y_inc); + static T dot(KK_INT n, const T *x, KK_INT x_inc, const T *y, KK_INT y_inc); - static void axpy(int n, const T alpha, const T *x, int x_inc, - /* */ T *y, int y_inc); + static void axpy(KK_INT n, const T alpha, const T *x, KK_INT x_inc, + /* */ T *y, KK_INT y_inc); - static void rot(int const N, T *X, int const incx, T *Y, int const incy, - mag_type *c, mag_type *s); + static void rot(KK_INT const N, T *X, KK_INT const incx, T *Y, + KK_INT const incy, mag_type *c, mag_type *s); static void rotg(T *a, T *b, mag_type *c, T *s); - static void rotm(const int n, T *X, const int incx, T *Y, const int incy, - T const *param); + static void rotm(const KK_INT n, T *X, const KK_INT incx, T *Y, + const KK_INT incy, T const *param); static void rotmg(T *d1, T *d2, T *x1, const T *y1, T *param); - static void swap(int const N, T *X, int const incx, T *Y, int const incy); + static void swap(KK_INT const N, T *X, KK_INT const incx, T *Y, + KK_INT const incy); + + static void gemv(const char trans, KK_INT m, KK_INT n, const T alpha, + const T *a, KK_INT lda, const T *b, KK_INT ldb, const T beta, + /* */ T *c, KK_INT ldc); - static void gemv(const char trans, int m, int n, const T alpha, const T *a, - int lda, const T *b, int ldb, const T beta, - /* */ T *c, int ldc); + static void ger(KK_INT m, KK_INT n, const T alpha, const T *x, KK_INT incx, + const T *y, KK_INT incy, T *a, KK_INT lda); - static void ger(int m, int n, const T alpha, const T *x, int incx, const T *y, - int incy, T *a, int lda); + static void geru(KK_INT m, KK_INT n, const T alpha, const T *x, KK_INT incx, + const T *y, KK_INT incy, T *a, KK_INT lda); - static void geru(int m, int n, const T alpha, const T *x, int incx, - const T *y, int incy, T *a, int lda); + static void gerc(KK_INT m, KK_INT n, const T alpha, const T *x, KK_INT incx, + const T *y, KK_INT incy, T *a, KK_INT lda); - static void gerc(int m, int n, const T alpha, const T *x, int incx, - const T *y, int incy, T *a, int lda); + static void syr(const char uplo, KK_INT n, const T alpha, const T *x, + KK_INT incx, T *a, KK_INT lda); - static void syr(const char uplo, int n, const T alpha, const T *x, int incx, - T *a, int lda); + static void syr2(const char uplo, KK_INT n, const T alpha, const T *x, + KK_INT incx, const T *y, KK_INT incy, T *a, KK_INT lda); template - static void cher(const char uplo, int n, const tAlpha alpha, const T *x, - int incx, T *a, int lda); + static void cher(const char uplo, KK_INT n, const tAlpha alpha, const T *x, + KK_INT incx, T *a, KK_INT lda); template - static void zher(const char uplo, int n, const tAlpha alpha, const T *x, - int incx, T *a, int lda); + static void zher(const char uplo, KK_INT n, const tAlpha alpha, const T *x, + KK_INT incx, T *a, KK_INT lda); + + static void cher2(const char uplo, KK_INT n, const T alpha, const T *x, + KK_INT incx, const T *y, KK_INT incy, T *a, KK_INT lda); + + static void zher2(const char uplo, KK_INT n, const T alpha, const T *x, + KK_INT incx, const T *y, KK_INT incy, T *a, KK_INT lda); - static void trsv(const char uplo, const char transa, const char diag, int m, - const T *a, int lda, - /* */ T *b, int ldb); + static void trsv(const char uplo, const char transa, const char diag, + KK_INT m, const T *a, KK_INT lda, + /* */ T *b, KK_INT ldb); - static void gemm(const char transa, const char transb, int m, int n, int k, - const T alpha, const T *a, int lda, const T *b, int ldb, - const T beta, - /* */ T *c, int ldc); + static void gemm(const char transa, const char transb, KK_INT m, KK_INT n, + KK_INT k, const T alpha, const T *a, KK_INT lda, const T *b, + KK_INT ldb, const T beta, + /* */ T *c, KK_INT ldc); - static void herk(const char transa, const char transb, int n, int k, - const T alpha, const T *a, int lda, const T beta, - /* */ T *c, int ldc); + static void herk(const char transa, const char transb, KK_INT n, KK_INT k, + const T alpha, const T *a, KK_INT lda, const T beta, + /* */ T *c, KK_INT ldc); static void trmm(const char side, const char uplo, const char transa, - const char diag, int m, int n, const T alpha, const T *a, - int lda, - /* */ T *b, int ldb); + const char diag, KK_INT m, KK_INT n, const T alpha, + const T *a, KK_INT lda, + /* */ T *b, KK_INT ldb); static void trsm(const char side, const char uplo, const char transa, - const char diag, int m, int n, const T alpha, const T *a, - int lda, - /* */ T *b, int ldb); + const char diag, KK_INT m, KK_INT n, const T alpha, + const T *a, KK_INT lda, + /* */ T *b, KK_INT ldb); }; } // namespace Impl } // namespace KokkosBlas diff --git a/blas/unit_test/Test_Blas.hpp b/blas/unit_test/Test_Blas.hpp index a29c5ffd72..9bb37d8d95 100644 --- a/blas/unit_test/Test_Blas.hpp +++ b/blas/unit_test/Test_Blas.hpp @@ -21,6 +21,7 @@ #include "Test_Blas1_asum.hpp" #include "Test_Blas1_axpby.hpp" #include "Test_Blas1_axpy.hpp" +#include "Test_Blas1_axpby_unification.hpp" #include "Test_Blas1_dot.hpp" #include "Test_Blas1_iamax.hpp" #include "Test_Blas1_mult.hpp" @@ -60,6 +61,7 @@ #include "Test_Blas2_gemv.hpp" #include "Test_Blas2_ger.hpp" #include "Test_Blas2_syr.hpp" +#include "Test_Blas2_syr2.hpp" // Serial Blas 2 #include "Test_Blas2_serial_gemv.hpp" diff --git a/blas/unit_test/Test_Blas1_axpby.hpp b/blas/unit_test/Test_Blas1_axpby.hpp index 8d5afb5f0b..299e18e493 100644 --- a/blas/unit_test/Test_Blas1_axpby.hpp +++ b/blas/unit_test/Test_Blas1_axpby.hpp @@ -109,8 +109,6 @@ void impl_test_axpby_mv(int N, int K) { Kokkos::deep_copy(org_y.h_base, y.d_base); Kokkos::deep_copy(x.h_base, x.d_base); - Kokkos::View r("Dot::Result", K); - KokkosBlas::axpby(a, x.d_view, b, y.d_view); Kokkos::deep_copy(y.h_base, y.d_base); diff --git a/blas/unit_test/Test_Blas1_axpby_unification.hpp b/blas/unit_test/Test_Blas1_axpby_unification.hpp new file mode 100644 index 0000000000..6ce7bad0b1 --- /dev/null +++ b/blas/unit_test/Test_Blas1_axpby_unification.hpp @@ -0,0 +1,2741 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +// ********************************************************************** +// The tests executed by the code below cover many combinations for +// the operation y += a * x + b * y: +// 01) Type of 'x' and 'a' components: float, double, complex, ... +// 02) Type of 'y' and 'b' components: float, double, complex, ... +// 03) Execution space: serial, threads, OpenMP, Cuda, ... +// 04) Layout of 'x' and 'a' +// 05) Layout of 'y' and 'b' +// 06) Ranks of 'x' and 'y': rank-1 or rank-2 +// 07) Ranks of 'a' and 'b': scalars or rank-0 or rank-1 +// +// Choices (01)-(03) are selected in the routines TEST_F() at the very +// bottom of the file, when calling: +// - either test_axpby_unification<...>(), +// - or test_axpby_mv_unification<...>(). +// +// Choices (04)-(05) are selected in routines: +// - test_axpby_unification<...>(), when calling +// Test::impl_test_axpby_unification<...>(), and +// - test_axpby_mv_unification<...>(), when calling +// Test::impl_test_axpby_mv_unification<...>(). +// +// Choices (06)-(07) are selected in routines: +// - Test::impl_test_axpby_unification<...>(), through +// 16 different combinations and calls to +// Test::impl_test_axpby_unification_compare<...>(), and +// - Test::impl_test_axpby_mv_unification<...>(), through +// 36 different combinations and calls to +// Test::impl_test_axpby_mv_unification_compare<...>(). +// +// The constexpr integer value 15 for 'numVecsAxpbyTest' was chosen to +// force the test of the three unrolling values 8, 4, and 1, in routine +// Axpby_MV_Invoke_Left<...>(...) in file KokkosBlas1_axpby_mv_impl.hpp +// ********************************************************************** + +#include +#include +#include +#include + +static constexpr int numVecsAxpbyTest = 15; + +namespace Test { + +template +struct getScalarTypeFromT { + using type = T; +}; + +template +struct getScalarTypeFromT { + using type = typename T::value_type; +}; + +template +constexpr bool isRank0() { + if constexpr (Kokkos::is_view_v) { + return (T::rank == 0); + } + return false; +} + +template +void impl_test_axpby_unification_compare( + tA const& a, tX const& x, tB const& b, tY const& y, int N, + bool testWithNanY, + typename Kokkos::ArithTraits::mag_type const max_val, + typename Kokkos::ArithTraits::mag_type const max_error, + tScalarA const inputValueA = Kokkos::ArithTraits::zero(), + tScalarB const inputValueB = Kokkos::ArithTraits::zero()) { + using ScalarTypeX = + typename std::remove_const::type; + using ScalarTypeY = + typename std::remove_const::type; + + Kokkos::Random_XorShift64_Pool rand_pool( + 13718); + + { + ScalarTypeX randStart, randEnd; + Test::getRandomBounds(max_val, randStart, randEnd); + Kokkos::fill_random(x.d_view, rand_pool, randStart, randEnd); + } + Kokkos::deep_copy(x.h_base, x.d_base); + + { + ScalarTypeY randStart, randEnd; + Test::getRandomBounds(max_val, randStart, randEnd); + if (testWithNanY) { + Kokkos::deep_copy(y.d_view, Kokkos::ArithTraits::nan()); + } else { + Kokkos::fill_random(y.d_view, rand_pool, randStart, randEnd); + } + } + tY org_y("Org_Y", N); + Kokkos::deep_copy(org_y.h_base, y.d_base); + + tScalarA valueA(Kokkos::ArithTraits::zero()); + tScalarB valueB(Kokkos::ArithTraits::zero()); + + if constexpr (std::is_same_v) { + valueA = a; + if constexpr (std::is_same_v) { + valueB = b; + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueB = inputValueB; + } else { + typename tB::HostMirror h_b("h_B"); + Kokkos::deep_copy(h_b, b); + valueB = h_b(); + } + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else { + Kokkos::deep_copy(b.h_base, b.d_base); + valueB = b.h_view(0); + KokkosBlas::axpby(a, x.d_view, b.d_view, y.d_view); + } + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueA = inputValueA; + } else { + typename tA::HostMirror h_a("h_A"); + Kokkos::deep_copy(h_a, a); + valueA = h_a(); + } + if constexpr (std::is_same_v) { + valueB = b; + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueB = inputValueB; + } else { + typename tB::HostMirror h_b("h_B"); + Kokkos::deep_copy(h_b, b); + valueB = h_b(); + } + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else { + Kokkos::deep_copy(b.h_base, b.d_base); + valueB = b.h_view(0); + KokkosBlas::axpby(a, x.d_view, b.d_view, y.d_view); + } + } else { + Kokkos::deep_copy(a.h_base, a.d_base); + valueA = a.h_view(0); + if constexpr (std::is_same_v) { + valueB = b; + KokkosBlas::axpby(a.d_view, x.d_view, b, y.d_view); + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueB = inputValueB; + } else { + typename tB::HostMirror h_b("h_B"); + Kokkos::deep_copy(h_b, b); + valueB = h_b(); + } + KokkosBlas::axpby(a.d_view, x.d_view, b, y.d_view); + } else { + Kokkos::deep_copy(b.h_base, b.d_base); + valueB = b.h_view(0); + KokkosBlas::axpby(a.d_view, x.d_view, b.d_view, y.d_view); + } + } + + Kokkos::deep_copy(y.h_base, y.d_base); + + if (testWithNanY == false) { + for (int i(0); i < N; ++i) { + EXPECT_NEAR_KK(static_cast(valueA * x.h_view(i) + + valueB * org_y.h_view(i)), + y.h_view(i), 4. * max_error); + } + } else { + // ******************************************************** + // Tests with 'Y == nan()' are called only for cases where + // b == Kokkos::ArithTraits::zero() + // ******************************************************** + for (int i(0); i < N; ++i) { +#if 0 + ScalarTypeY tmp = static_cast(valueA * x.h_view(i) + valueB * org_y.h_view(i)); + std::cout << "i = " << i + << ", valueA = " << valueA + << ", x.h_view(i) = " << x.h_view(i) + << ", valueB = " << valueB + << ", org_y.h_view(i) = " << org_y.h_view(i) + << ", tmp = " << tmp + << ", y.h_view(i) = " << y.h_view(i) + << std::endl; +#endif + if constexpr (std::is_same_v) { + // **************************************************************** + // 'nan()' converts to '-1' in case of 'int' => no need to compare + // **************************************************************** + if (y.h_view(i) != -1) { + EXPECT_NE(y.h_view(i), Kokkos::ArithTraits::nan()); + } + } else { + EXPECT_NE(y.h_view(i), Kokkos::ArithTraits::nan()); + } + EXPECT_NEAR_KK(static_cast(valueA * x.h_view(i)), + y.h_view(i), 4. * max_error); + } + } +} + +template +void impl_test_axpby_mv_unification_compare( + tA const& a, tX const& x, tB const& b, tY const& y, int N, int K, + bool testWithNanY, + typename Kokkos::ArithTraits::mag_type const max_val, + typename Kokkos::ArithTraits::mag_type const max_error, + tScalarA const inputValueA = Kokkos::ArithTraits::zero(), + tScalarB const inputValueB = Kokkos::ArithTraits::zero()) { + using ScalarTypeX = + typename std::remove_const::type; + using ScalarTypeY = + typename std::remove_const::type; + + Kokkos::Random_XorShift64_Pool rand_pool( + 13718); + + { + ScalarTypeX randStart, randEnd; + Test::getRandomBounds(max_val, randStart, randEnd); + Kokkos::fill_random(x.d_view, rand_pool, randStart, randEnd); + } + Kokkos::deep_copy(x.h_base, x.d_base); + + { + ScalarTypeY randStart, randEnd; + Test::getRandomBounds(max_val, randStart, randEnd); + if (testWithNanY) { + Kokkos::deep_copy(y.d_view, Kokkos::ArithTraits::nan()); + } else { + Kokkos::fill_random(y.d_view, rand_pool, randStart, randEnd); + } + } + tY org_y("Org_Y", N, K); + Kokkos::deep_copy(org_y.h_base, y.d_base); + + // Cannot use "if constexpr (isRank1()) {" because rank-1 variables + // are passed to current routine with view_stride_adapter<...> + bool constexpr aIsRank1 = !std::is_same_v && !isRank0(); + if constexpr (aIsRank1) { + Kokkos::deep_copy(a.h_base, a.d_base); + } + + // Cannot use "if constexpr (isRank1()) {" because rank-1 variables + // are passed to current routine with view_stride_adapter<...> + bool constexpr bIsRank1 = !std::is_same_v && !isRank0(); + if constexpr (bIsRank1) { + Kokkos::deep_copy(b.h_base, b.d_base); + } + + tScalarA valueA(Kokkos::ArithTraits::zero()); + tScalarB valueB(Kokkos::ArithTraits::zero()); + if constexpr (std::is_same_v) { + valueA = a; + if constexpr (std::is_same_v) { + valueB = b; + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueB = inputValueB; + } else { + typename tB::HostMirror h_b("h_B"); + Kokkos::deep_copy(h_b, b); + valueB = h_b(); + } + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else { + valueB = b.h_view(0); + KokkosBlas::axpby(a, x.d_view, b.d_view, y.d_view); + } + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueA = inputValueA; + } else { + typename tA::HostMirror h_a("h_A"); + Kokkos::deep_copy(h_a, a); + valueA = h_a(); + } + if constexpr (std::is_same_v) { + valueB = b; + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueB = inputValueB; + } else { + typename tB::HostMirror h_b("h_B"); + Kokkos::deep_copy(h_b, b); + valueB = h_b(); + } + KokkosBlas::axpby(a, x.d_view, b, y.d_view); + } else { + valueB = b.h_view(0); + KokkosBlas::axpby(a, x.d_view, b.d_view, y.d_view); + } + } else { + valueA = a.h_view(0); + if constexpr (std::is_same_v) { + valueB = b; + KokkosBlas::axpby(a.d_view, x.d_view, b, y.d_view); + } else if constexpr (isRank0()) { + if constexpr (std::is_same_v) { + valueB = inputValueB; + } else { + typename tB::HostMirror h_b("h_B"); + Kokkos::deep_copy(h_b, b); + valueB = h_b(); + } + KokkosBlas::axpby(a.d_view, x.d_view, b, y.d_view); + } else { + valueB = b.h_view(0); + KokkosBlas::axpby(a.d_view, x.d_view, b.d_view, y.d_view); + } + } + + Kokkos::deep_copy(y.h_base, y.d_base); + + if (testWithNanY == false) { + for (int i(0); i < N; ++i) { + for (int k(0); k < K; ++k) { + ScalarTypeY vanillaValue(Kokkos::ArithTraits::zero()); + if constexpr (aIsRank1) { + (void)valueA; // Avoid "set but not used" error + if constexpr (bIsRank1) { + (void)valueB; // Avoid "set but not used" error + int a_k(a.h_view.extent(0) == 1 ? 0 : k); + int b_k(b.h_view.extent(0) == 1 ? 0 : k); +#if 0 + std::cout << "In impl_test_axpby_mv_unification_compare()" + << ": i = " << i + << ", k = " << k + << ", a.h_view.extent(0) = " << a.h_view.extent(0) + << ", a_k = " << a_k + << ", b.h_view.extent(0) = " << b.h_view.extent(0) + << ", b_k = " << b_k + << ", a.h_view(a_k) = " << a.h_view(a_k) + << ", x.h_view(i, k) = " << x.h_view(i, k) + << ", b.h_view(b_k) = " << b.h_view(b_k) + << ", org_y.h_view(i, k) = " << org_y.h_view(i, k) + << std::endl; +#endif + vanillaValue = + static_cast(a.h_view(a_k) * x.h_view(i, k) + + b.h_view(b_k) * org_y.h_view(i, k)); + } else { + int a_k(a.h_view.extent(0) == 1 ? 0 : k); + vanillaValue = static_cast( + a.h_view(a_k) * x.h_view(i, k) + valueB * org_y.h_view(i, k)); + } + } else { + if constexpr (bIsRank1) { + (void)valueB; // Avoid "set but not used" error + int b_k(b.h_view.extent(0) == 1 ? 0 : k); + vanillaValue = static_cast( + valueA * x.h_view(i, k) + b.h_view(b_k) * org_y.h_view(i, k)); + } else { + vanillaValue = static_cast( + valueA * x.h_view(i, k) + valueB * org_y.h_view(i, k)); + } + } +#if 0 + std::cout << "In impl_test_axpby_mv_unification_compare(1)" + << ": i = " << i + << ", k = " << k + << ", y.h_view(i, k) = " << y.h_view(i, k) + << ", vanillaValue = " << vanillaValue + << std::endl; +#endif + EXPECT_NEAR_KK(vanillaValue, y.h_view(i, k), 4. * max_error); + } + } + } else { + // ******************************************************** + // Tests with 'Y == nan()' are called only for cases where + // b == Kokkos::ArithTraits::zero() + // ******************************************************** + for (int i(0); i < N; ++i) { + for (int k(0); k < K; ++k) { + ScalarTypeY vanillaValue(Kokkos::ArithTraits::zero()); + if constexpr (aIsRank1) { + (void)valueA; // Avoid "set but not used" error + int a_k(a.h_view.extent(0) == 1 ? 0 : k); + vanillaValue = + static_cast(a.h_view(a_k) * x.h_view(i, k)); +#if 0 + ScalarTypeY tmp = static_cast(a.h_view(a_k) * x.h_view(i, k) + valueB * org_y.h_view(i, k)); + std::cout << "i = " << i + << ", k = " << k + << ", a_k = " << a_k + << ", a.h_view(a_k) = " << a.h_view(a_k) + << ", x.h_view(i, k) = " << x.h_view(i, k) + << ", valueB = " << valueB + << ", org_y.h_view(i, k) = " << org_y.h_view(i, k) + << ", tmp = " << tmp + << ", vanillaValue = " << vanillaValue + << ", y.h_view(i, k) = " << y.h_view(i, k) + << std::endl; +#endif + } else { + vanillaValue = static_cast(valueA * x.h_view(i, k)); +#if 0 + ScalarTypeY tmp = static_cast(valueA * x.h_view(i, k) + valueB * org_y.h_view(i, k)); + std::cout << "i = " << i + << ", k = " << k + << ", valueA = " << valueA + << ", x.h_view(i, k) = " << x.h_view(i, k) + << ", valueB = " << valueB + << ", org_y.h_view(i, k) = " << org_y.h_view(i, k) + << ", tmp = " << tmp + << ", vanillaValue = " << vanillaValue + << ", y.h_view(i, k) = " << y.h_view(i, k) + << std::endl; +#endif + } + + if constexpr (std::is_same_v) { + // **************************************************************** + // 'nan()' converts to '-1' in case of 'int' => no need to compare + // **************************************************************** + if (y.h_view(i, k) != -1) { + EXPECT_NE(y.h_view(i, k), Kokkos::ArithTraits::nan()); + } + } else { + EXPECT_NE(y.h_view(i, k), Kokkos::ArithTraits::nan()); + } +#if 0 + std::cout << "In impl_test_axpby_mv_unification_compare(2)" + << ": i = " << i + << ", k = " << k + << ", y.h_view(i, k) = " << y.h_view(i, k) + << ", vanillaValue = " << vanillaValue + << std::endl; +#endif + EXPECT_NEAR_KK(vanillaValue, y.h_view(i, k), 4. * max_error); + } + } + } +} + +template +void impl_test_axpby_unification(int const N) { + using ViewTypeAr0 = Kokkos::View; + using ViewTypeAr1s_1 = Kokkos::View; + using ViewTypeAr1d = Kokkos::View; + + using ViewTypeX = Kokkos::View; + + using ViewTypeBr0 = Kokkos::View; + using ViewTypeBr1s_1 = Kokkos::View; + using ViewTypeBr1d = Kokkos::View; + + using ViewTypeY = Kokkos::View; + + std::array const valuesA{ + -1, Kokkos::ArithTraits::zero(), 1, 3}; + std::array const valuesB{ + -1, Kokkos::ArithTraits::zero(), 1, 5}; + + // eps should probably be based on tScalarB since that is the type + // in which the result is computed. + using MagnitudeB = typename Kokkos::ArithTraits::mag_type; + MagnitudeB const eps = Kokkos::ArithTraits::epsilon(); + MagnitudeB const max_val = 10; + MagnitudeB const max_error = + static_cast( + Kokkos::ArithTraits::abs(valuesA[valuesA.size() - 1]) + + Kokkos::ArithTraits::abs(valuesB[valuesB.size() - 1])) * + max_val * eps; + + // ************************************************************ + // Case 01/16: Ascalar + Bscalar + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 01/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N); + tScalarB b; + view_stride_adapter y("Y", N); + + a = valueA; + b = valueB; + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 02/16: Ascalar + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 02/16" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + // ViewTypeBr0 b; + // Kokkos::deep_copy(b, valueB); + // //std::cout << "b() = " << b() << std::endl; + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N); + + a = valueA; + Kokkos::deep_copy(b, valueB); + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, true, max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 03/16: Ascalar + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 03/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + a = valueA; + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 04/16: Ascalar + Br1d + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 04/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + a = valueA; + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 05/16: Ar0 + Bscalar + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 05/16" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N); + tScalarB b; + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a, valueA); + b = valueB; + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, true, max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 06/16: Ar0 + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 06/16" << std::endl; +#endif + if constexpr ((std::is_same_v) || + (std::is_same_v)) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a, valueA); + Kokkos::deep_copy(b, valueB); + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, true, max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 07/16: Ar0 + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 07/16" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 08/16: Ar0 + Br1d + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 08/16" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 09/16: Ar1s_1 + Bscalar + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 09/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + tScalarB b; + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + b = valueB; + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 10/16: Ar1s_1 + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 10/16" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b, valueB); + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 11/16: Ar1s_1 + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 11/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 12/16: Ar1s_1 + Br1d + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 12/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 13/16: Ar1d + Bscalar + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 13/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + tScalarB b; + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + b = valueB; + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 14/16: Ar1d + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 14/16" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b, valueB); + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 15/16: Ar1d + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 15/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 16/16: Ar1d + Br1d + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 16/16" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, true, max_val, max_error); + } + } + } + } +} + +template +void impl_test_axpby_mv_unification(int const N, int const K) { + // std::cout << "=========================================" << std::endl; + // std::cout << "Entering impl_test_axpby_mv_unification()" + // << ": tLayoutA = " << typeid(tLayoutA).name() + // << ": tLayoutX = " << typeid(tLayoutX).name() + // << ", tLayoutB = " << typeid(tLayoutB).name() + // << ": tLayoutY = " << typeid(tLayoutY).name() + // << std::endl; + using ViewTypeAr0 = Kokkos::View; + using ViewTypeAr1s_1 = Kokkos::View; + using ViewTypeAr1s_k = Kokkos::View; // Yes, hard coded + using ViewTypeAr1d = Kokkos::View; + + using ViewTypeX = Kokkos::View; + + using ViewTypeBr0 = Kokkos::View; + using ViewTypeBr1s_1 = Kokkos::View; + using ViewTypeBr1s_k = Kokkos::View; // Yes, hard coded + using ViewTypeBr1d = Kokkos::View; + + using ViewTypeY = Kokkos::View; + + std::array const valuesA{ + -1, Kokkos::ArithTraits::zero(), 1, 3}; + std::array const valuesB{ + -1, Kokkos::ArithTraits::zero(), 1, 5}; + + // eps should probably be based on tScalarB since that is the type + // in which the result is computed. + using MagnitudeB = typename Kokkos::ArithTraits::mag_type; + MagnitudeB const eps = Kokkos::ArithTraits::epsilon(); + MagnitudeB const max_val = 10; + MagnitudeB const max_error = + static_cast( + Kokkos::ArithTraits::abs(valuesA[valuesA.size() - 1]) + + Kokkos::ArithTraits::abs(valuesB[valuesB.size() - 1])) * + max_val * eps; + + // ************************************************************ + // Case 01/36: Ascalar + Bscalar + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 01/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N, K); + tScalarB b; + view_stride_adapter y("Y", N, K); + + a = valueA; + b = valueB; + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 02/36: Ascalar + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 02/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N, K); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N, K); + + a = valueA; + Kokkos::deep_copy(b, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 03/36: Ascalar + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 03/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + a = valueA; + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 04/36: Ascalar + Br1s_k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 04/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + tScalarA a; + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + a = valueA; + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 05/36: Ascalar + Br1d,1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 05/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + a = valueA; + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 06/36: Ascalar + Br1d,k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 06/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + tScalarA a; + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + a = valueA; + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, tScalarA, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 07/36: Ar0 + Bscalar + // ************************************************************w +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 07/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N, K); + tScalarB b; + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a, valueA); + b = valueB; + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + tScalarB, view_stride_adapter, Device>( + a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 08/36: Ar0 + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 08/36" << std::endl; +#endif + if constexpr ((std::is_same_v) || + (std::is_same_v)) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N, K); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a, valueA); + Kokkos::deep_copy(b, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + ViewTypeBr0, view_stride_adapter, Device>( + a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 09/36: Ar0 + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 09/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 10/36: Ar0 + Br1s_k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 10/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a, valueA); + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 11/36: Ar0 + Br1d,1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 11/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 12/36: Ar0 + Br1d,k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 12/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + ViewTypeAr0 a("A"); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a, valueA); + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, ViewTypeAr0, view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 13/36: Ar1s_1 + Bscalar + // ************************************************************w +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 13/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + tScalarB b; + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + b = valueB; + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 14/36: Ar1s_1 + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 14/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 15/36: Ar1s_1 + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 15/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 16/36: Ar1s_1 + Br1s_k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 16/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 17/36: Ar1s_1 + Br1d,1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 17/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 18/36: Ar1s_1 + Br1d,k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 18/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 19/36: Ar1s_k + Bscalar + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 19/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + tScalarB b; + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + b = valueB; + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 20/36: Ar1s_k + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 20/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + Kokkos::deep_copy(b, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 21/36: Ar1s_k + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 21/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 22/36: Ar1s_k + Br1s_k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 22/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 23/36: Ar1s_k + Br1d,1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 23/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 24/36: Ar1s_k + Br1d,k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 24/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 25/36: Ar1d,1 + Bscalar + // ************************************************************w +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 25/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + tScalarB b; + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + b = valueB; + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 26/36: Ar1d,1 + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 26/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 27/36: Ar1d,1 + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 27/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 28/36: Ar1d,1 + Br1s_k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 28/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 29/36: Ar1d,1 + Br1d,1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 29/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 30/36: Ar1d,1 + Br1d,k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 30/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", 1); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + Kokkos::deep_copy(a.d_base, valueA); + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 31/36: Ar1d,k + Bscalar + // ************************************************************w +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 31/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + tScalarB b; + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + b = valueB; + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, tScalarB, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 32/36: Ar1d,k + Br0 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 32/36" << std::endl; +#endif + if constexpr (std::is_same_v) { + // Avoid the test, due to compilation errors + } else { + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + ViewTypeBr0 b("B"); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + Kokkos::deep_copy(b, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, false, + max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, ViewTypeBr0, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + } + + // ************************************************************ + // Case 33/36: Ar1d,k + Br1s_1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 33/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, + view_stride_adapter, Device>(a, x, b, y, N, K, true, + max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 34/36: Ar1d,k + Br1s_k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 34/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + if (K == numVecsAxpbyTest) { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // ************************************************************ + // Case 35/36: Ar1d,k + Br1d,1 + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 35/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", 1); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + Kokkos::deep_copy(b.d_base, valueB); + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + if (valueB == Kokkos::ArithTraits::zero()) { + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, true, max_val, max_error); + } + } + } + } + + // ************************************************************ + // Case 36/36: Ar1d,k + Br1d,k + // ************************************************************ +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Starting case 36/36" << std::endl; +#endif + for (size_t i(0); i < valuesA.size(); ++i) { + tScalarA const valueA(valuesA[i]); + for (size_t j(0); j < valuesB.size(); ++j) { + tScalarB const valueB(valuesB[j]); + { + view_stride_adapter a("A", K); + view_stride_adapter x("X", N, K); + view_stride_adapter b("B", K); + view_stride_adapter y("Y", N, K); + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + a.h_view[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } else { + for (int k(0); k < K; ++k) { + a.h_base[k] = valueA + k; + } + Kokkos::deep_copy(a.d_base, a.h_base); + } + + if constexpr (std::is_same_v) { + for (int k(0); k < K; ++k) { + b.h_view[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } else { + for (int k(0); k < K; ++k) { + b.h_base[k] = valueB + k; + } + Kokkos::deep_copy(b.d_base, b.h_base); + } + + impl_test_axpby_mv_unification_compare< + tScalarA, view_stride_adapter, + view_stride_adapter, tScalarB, + view_stride_adapter, view_stride_adapter, + Device>(a, x, b, y, N, K, false, max_val, max_error); + } + } + } + + // std::cout << "Leaving impl_test_axpby_mv_unification()" << std::endl; + // std::cout << "=========================================" << std::endl; +} + +} // namespace Test + +template +int test_axpby_unification() { +#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-LLL" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutLeft, tScalarX, Kokkos::LayoutLeft, tScalarB, + Kokkos::LayoutLeft, tScalarY, Kokkos::LayoutLeft, Device>(14); +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-RRR" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutRight, tScalarX, Kokkos::LayoutRight, tScalarB, + Kokkos::LayoutRight, tScalarY, Kokkos::LayoutRight, Device>(14); +#endif + +#if (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-SSS" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutStride, tScalarX, Kokkos::LayoutStride, tScalarB, + Kokkos::LayoutStride, tScalarY, Kokkos::LayoutStride, Device>(14); +#endif + +#if !defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-SLL" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutStride, tScalarX, Kokkos::LayoutStride, tScalarB, + Kokkos::LayoutLeft, tScalarY, Kokkos::LayoutLeft, Device>(14); + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-LSS" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutLeft, tScalarX, Kokkos::LayoutLeft, tScalarB, + Kokkos::LayoutStride, tScalarY, Kokkos::LayoutStride, Device>(14); + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-SRS" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutLeft, tScalarX, Kokkos::LayoutStride, tScalarB, + Kokkos::LayoutRight, tScalarY, Kokkos::LayoutStride, Device>(14); + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Calling impl_test_axpby_unif(), L-LSR" << std::endl; +#endif + Test::impl_test_axpby_unification< + tScalarA, Kokkos::LayoutStride, tScalarX, Kokkos::LayoutLeft, tScalarB, + Kokkos::LayoutStride, tScalarY, Kokkos::LayoutRight, Device>(14); +#endif + return 1; +} + +template +int test_axpby_mv_unification() { +#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutLeft, tScalarX, Kokkos::LayoutLeft, tScalarB, + Kokkos::LayoutLeft, tScalarY, Kokkos::LayoutLeft, Device>( + 14, numVecsAxpbyTest); +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutRight, tScalarX, Kokkos::LayoutRight, tScalarB, + Kokkos::LayoutRight, tScalarY, Kokkos::LayoutRight, Device>( + 14, numVecsAxpbyTest); +#endif + +#if (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutStride, tScalarX, Kokkos::LayoutStride, tScalarB, + Kokkos::LayoutStride, tScalarY, Kokkos::LayoutStride, Device>( + 14, numVecsAxpbyTest); +#endif + +#if !defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS) + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutStride, tScalarX, Kokkos::LayoutStride, tScalarB, + Kokkos::LayoutLeft, tScalarY, Kokkos::LayoutLeft, Device>( + 14, numVecsAxpbyTest); + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutLeft, tScalarX, Kokkos::LayoutLeft, tScalarB, + Kokkos::LayoutStride, tScalarY, Kokkos::LayoutStride, Device>( + 14, numVecsAxpbyTest); + + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutLeft, tScalarX, Kokkos::LayoutStride, tScalarB, + Kokkos::LayoutRight, tScalarY, Kokkos::LayoutStride, Device>( + 14, numVecsAxpbyTest); + + Test::impl_test_axpby_mv_unification< + tScalarA, Kokkos::LayoutStride, tScalarX, Kokkos::LayoutLeft, tScalarB, + Kokkos::LayoutStride, tScalarY, Kokkos::LayoutRight, Device>( + 14, numVecsAxpbyTest); +#endif + return 1; +} + +#if defined(KOKKOSKERNELS_INST_FLOAT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, axpby_unification_float) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::axpby_unification_float"); + test_axpby_unification(); + Kokkos::Profiling::popRegion(); +} +TEST_F(TestCategory, axpby_mv_unification_float) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::axpby_mv_unification_float"); + test_axpby_mv_unification(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_DOUBLE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, axpby_unification_double) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::axpby_unification_double"); + test_axpby_unification(); +} +TEST_F(TestCategory, axpby_mv_unification_double) { + Kokkos::Profiling::pushRegion( + "KokkosBlas::Test::axpby_mv_unification_double"); + test_axpby_mv_unification(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_COMPLEX_DOUBLE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, axpby_unification_complex_double) { + Kokkos::Profiling::pushRegion( + "KokkosBlas::Test::axpby_unification_complex_double"); + test_axpby_unification, Kokkos::complex, + Kokkos::complex, Kokkos::complex, + TestDevice>(); + Kokkos::Profiling::popRegion(); +} +TEST_F(TestCategory, axpby_mv_unification_complex_double) { + Kokkos::Profiling::pushRegion( + "KokkosBlas::Test::axpby_mv_unification_complex_double"); + test_axpby_mv_unification, Kokkos::complex, + Kokkos::complex, Kokkos::complex, + TestDevice>(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_INT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, axpby_unification_int) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::axpby_unification_int"); + test_axpby_unification(); + Kokkos::Profiling::popRegion(); +} +TEST_F(TestCategory, axpby_mv_unification_int) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::axpby_mv_unification_int"); + test_axpby_mv_unification(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if !defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS) +TEST_F(TestCategory, axpby_unification_double_int) { + Kokkos::Profiling::pushRegion( + "KokkosBlas::Test::axpby_unification_double_int"); + test_axpby_unification(); + Kokkos::Profiling::popRegion(); +} +TEST_F(TestCategory, axpby_double_mv_unification_int) { + Kokkos::Profiling::pushRegion( + "KokkosBlas::Test::axpby_mv_unification_double_int"); + test_axpby_mv_unification(); + Kokkos::Profiling::popRegion(); +} +#endif diff --git a/blas/unit_test/Test_Blas1_nrm1.hpp b/blas/unit_test/Test_Blas1_nrm1.hpp index f6938c5147..24795878d1 100644 --- a/blas/unit_test/Test_Blas1_nrm1.hpp +++ b/blas/unit_test/Test_Blas1_nrm1.hpp @@ -22,10 +22,10 @@ namespace Test { template void impl_test_nrm1(int N) { - typedef typename ViewTypeA::value_type ScalarA; - typedef Kokkos::ArithTraits AT; - typedef typename AT::mag_type mag_type; - typedef Kokkos::ArithTraits MAT; + using ScalarA = typename ViewTypeA::value_type; + using AT = Kokkos::ArithTraits; + using mag_type = typename AT::mag_type; + using MAT = Kokkos::ArithTraits; view_stride_adapter a("a", N); diff --git a/blas/unit_test/Test_Blas1_swap.hpp b/blas/unit_test/Test_Blas1_swap.hpp index 382c35947b..624552f1dc 100644 --- a/blas/unit_test/Test_Blas1_swap.hpp +++ b/blas/unit_test/Test_Blas1_swap.hpp @@ -3,11 +3,12 @@ namespace Test { namespace Impl { -template +template void test_swap(int const vector_length) { - using vector_type = VectorType; - using execution_space = typename vector_type::execution_space; - using scalar_type = typename VectorType::non_const_value_type; + using execution_space = typename DeviceType::execution_space; + using memory_space = typename DeviceType::memory_space; + using vector_type = Kokkos::View; + using scalar_type = typename vector_type::non_const_value_type; using mag_type = typename Kokkos::ArithTraits::mag_type; // Note that Xref and Yref need to always be copies of X and Y @@ -43,14 +44,12 @@ void test_swap(int const vector_length) { } // namespace Impl } // namespace Test -template +template int test_swap() { - using Vector = Kokkos::View; - - Test::Impl::test_swap(0); - Test::Impl::test_swap(10); - Test::Impl::test_swap(256); - Test::Impl::test_swap(1024); + Test::Impl::test_swap(0); + Test::Impl::test_swap(10); + Test::Impl::test_swap(256); + Test::Impl::test_swap(1024); return 0; } diff --git a/blas/unit_test/Test_Blas2_ger.hpp b/blas/unit_test/Test_Blas2_ger.hpp index a0860bae04..df3d2cb5d1 100644 --- a/blas/unit_test/Test_Blas2_ger.hpp +++ b/blas/unit_test/Test_Blas2_ger.hpp @@ -79,10 +79,11 @@ class GerTester { using _KAT_A = Kokkos::ArithTraits; using _AuxType = typename _KAT_A::mag_type; - void populateVariables(ScalarA& alpha, _HostViewTypeX& h_x, - _HostViewTypeY& h_y, _HostViewTypeA& h_A, - _ViewTypeExpected& h_expected, _ViewTypeX& x, - _ViewTypeY& y, _ViewTypeA& A, + void populateVariables(ScalarA& alpha, + view_stride_adapter<_ViewTypeX, false>& x, + view_stride_adapter<_ViewTypeY, false>& y, + view_stride_adapter<_ViewTypeA, false>& A, + _ViewTypeExpected& h_expected, bool& expectedResultIsKnown); template @@ -149,11 +150,10 @@ class GerTester { T shrinkAngleToZeroTwoPiRange(const T input); template - void callKkGerAndCompareAgainstExpected(const ScalarA& alpha, TX& x, TY& y, - _ViewTypeA& A, - const _HostViewTypeA& h_A, - const _ViewTypeExpected& h_expected, - const std::string& situation); + void callKkGerAndCompareAgainstExpected( + const ScalarA& alpha, TX& x, TY& y, + view_stride_adapter<_ViewTypeA, false>& A, + const _ViewTypeExpected& h_expected, const std::string& situation); const bool _A_is_complex; const bool _A_is_lr; @@ -195,8 +195,12 @@ GerTester::value ? 1.0e-6 : 1.0e-9), - _relTol(std::is_same<_AuxType, float>::value ? 5.0e-3 : 1.0e-6), + _absTol(std::is_same<_AuxType, float>::value + ? 1.0e-6 + : (std::is_same<_AuxType, double>::value ? 1.0e-9 : 0)), + _relTol(std::is_same<_AuxType, float>::value + ? 5.0e-3 + : (std::is_same<_AuxType, double>::value ? 1.0e-6 : 0)), _M(-1), _N(-1), _useAnalyticalResults(false), @@ -282,8 +286,7 @@ void GerTesterpopulateVariables(alpha, x.h_view, y.h_view, A.h_view, - h_expected.d_view, x.d_view, y.d_view, A.d_view, + this->populateVariables(alpha, x, y, A, h_expected.d_view, expectedResultIsKnown); // ******************************************************************** @@ -329,8 +332,7 @@ void GerTestercallKkGerAndCompareAgainstExpected( - alpha, x.d_view, y.d_view, A.d_view, A.h_view, h_expected.d_view, - "non const {x,y}"); + alpha, x.d_view, y.d_view, A, h_expected.d_view, "non const {x,y}"); } // ******************************************************************** @@ -339,8 +341,7 @@ void GerTestercallKkGerAndCompareAgainstExpected(alpha, x.d_view_const, y.d_view, - A.d_view, A.h_view, + this->callKkGerAndCompareAgainstExpected(alpha, x.d_view_const, y.d_view, A, h_expected.d_view, "const x"); } @@ -350,8 +351,7 @@ void GerTestercallKkGerAndCompareAgainstExpected(alpha, x.d_view, y.d_view_const, - A.d_view, A.h_view, + this->callKkGerAndCompareAgainstExpected(alpha, x.d_view, y.d_view_const, A, h_expected.d_view, "const y"); } @@ -362,7 +362,7 @@ void GerTestercallKkGerAndCompareAgainstExpected(alpha, x.d_view_const, - y.d_view_const, A.d_view, A.h_view, + y.d_view_const, A, h_expected.d_view, "const {x,y}"); } @@ -384,52 +384,53 @@ void GerTester -void GerTester::populateVariables(ScalarA& alpha, _HostViewTypeX& h_x, - _HostViewTypeY& h_y, - _HostViewTypeA& h_A, - _ViewTypeExpected& h_expected, - _ViewTypeX& x, _ViewTypeY& y, - _ViewTypeA& A, - bool& expectedResultIsKnown) { +void GerTester< + ScalarX, tLayoutX, ScalarY, tLayoutY, ScalarA, tLayoutA, + Device>::populateVariables(ScalarA& alpha, + view_stride_adapter<_ViewTypeX, false>& x, + view_stride_adapter<_ViewTypeY, false>& y, + view_stride_adapter<_ViewTypeA, false>& A, + _ViewTypeExpected& h_expected, + bool& expectedResultIsKnown) { expectedResultIsKnown = false; if (_useAnalyticalResults) { - this->populateAnalyticalValues(alpha, h_x, h_y, h_A, h_expected); - Kokkos::deep_copy(x, h_x); - Kokkos::deep_copy(y, h_y); - Kokkos::deep_copy(A, h_A); + this->populateAnalyticalValues(alpha, x.h_view, y.h_view, A.h_view, + h_expected); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(y.d_base, y.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); expectedResultIsKnown = true; } else if ((_M == 1) && (_N == 1)) { alpha = 3; - h_x[0] = 2; + x.h_view[0] = 2; - h_y[0] = 3; + y.h_view[0] = 3; - h_A(0, 0) = 7; + A.h_view(0, 0) = 7; - Kokkos::deep_copy(x, h_x); - Kokkos::deep_copy(y, h_y); - Kokkos::deep_copy(A, h_A); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(y.d_base, y.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); h_expected(0, 0) = 25; expectedResultIsKnown = true; } else if ((_M == 1) && (_N == 2)) { alpha = 3; - h_x[0] = 2; + x.h_view[0] = 2; - h_y[0] = 3; - h_y[1] = 4; + y.h_view[0] = 3; + y.h_view[1] = 4; - h_A(0, 0) = 7; - h_A(0, 1) = -6; + A.h_view(0, 0) = 7; + A.h_view(0, 1) = -6; - Kokkos::deep_copy(x, h_x); - Kokkos::deep_copy(y, h_y); - Kokkos::deep_copy(A, h_A); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(y.d_base, y.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); h_expected(0, 0) = 25; h_expected(0, 1) = 18; @@ -437,20 +438,20 @@ void GerTester void GerTester:: - callKkGerAndCompareAgainstExpected(const ScalarA& alpha, TX& x, TY& y, - _ViewTypeA& A, const _HostViewTypeA& h_A, - const _ViewTypeExpected& h_expected, - const std::string& situation) { + callKkGerAndCompareAgainstExpected( + const ScalarA& alpha, TX& x, TY& y, + view_stride_adapter<_ViewTypeA, false>& A, + const _ViewTypeExpected& h_expected, const std::string& situation) { #ifdef HAVE_KOKKOSKERNELS_DEBUG #if KOKKOS_VERSION < 40199 KOKKOS_IMPL_DO_NOT_USE_PRINTF( @@ -1379,7 +1380,7 @@ void GerTestercompareKkGerAgainstExpected(alpha, h_A, h_expected); + this->compareKkGerAgainstExpected(alpha, A.h_view, h_expected); } } diff --git a/blas/unit_test/Test_Blas2_syr.hpp b/blas/unit_test/Test_Blas2_syr.hpp index 4396c81bb2..1253a8e329 100644 --- a/blas/unit_test/Test_Blas2_syr.hpp +++ b/blas/unit_test/Test_Blas2_syr.hpp @@ -76,9 +76,10 @@ class SyrTester { using _KAT_A = Kokkos::ArithTraits; using _AuxType = typename _KAT_A::mag_type; - void populateVariables(ScalarA& alpha, _HostViewTypeX& h_x, - _HostViewTypeA& h_A, _ViewTypeExpected& h_expected, - _ViewTypeX& x, _ViewTypeA& A, + void populateVariables(ScalarA& alpha, + view_stride_adapter<_ViewTypeX, false>& x, + view_stride_adapter<_ViewTypeA, false>& A, + _ViewTypeExpected& h_expected, bool& expectedResultIsKnown); template @@ -145,11 +146,9 @@ class SyrTester { T shrinkAngleToZeroTwoPiRange(const T input); template - void callKkSyrAndCompareAgainstExpected(const ScalarA& alpha, TX& x, - _ViewTypeA& A, - const _HostViewTypeA& h_A, - const _ViewTypeExpected& h_expected, - const std::string& situation); + void callKkSyrAndCompareAgainstExpected( + const ScalarA& alpha, TX& x, view_stride_adapter<_ViewTypeA, false>& A, + const _ViewTypeExpected& h_expected, const std::string& situation); template void callKkGerAndCompareKkSyrAgainstIt( @@ -198,8 +197,12 @@ SyrTester::SyrTester() // large enough to require 'relTol' to value 5.0e-3. The same // calculations show no discrepancies for calculations with double. // **************************************************************** - _absTol(std::is_same<_AuxType, float>::value ? 1.0e-6 : 1.0e-9), - _relTol(std::is_same<_AuxType, float>::value ? 5.0e-3 : 1.0e-6), + _absTol(std::is_same<_AuxType, float>::value + ? 1.0e-6 + : (std::is_same<_AuxType, double>::value ? 1.0e-9 : 0)), + _relTol(std::is_same<_AuxType, float>::value + ? 5.0e-3 + : (std::is_same<_AuxType, double>::value ? 1.0e-6 : 0)), _M(-1), _N(-1), _useAnalyticalResults(false), @@ -279,8 +282,8 @@ void SyrTester::test( // ******************************************************************** // Step 2 of 7: populate alpha, h_x, h_A, h_expected, x, A // ******************************************************************** - this->populateVariables(alpha, x.h_view, A.h_view, h_expected.d_view, - x.d_view, A.d_view, expectedResultIsKnown); + this->populateVariables(alpha, x, A, h_expected.d_view, + expectedResultIsKnown); // ******************************************************************** // Step 3 of 7: populate h_vanilla @@ -324,8 +327,8 @@ void SyrTester::test( Kokkos::deep_copy(org_A.h_view, A.h_view); if (test_x) { - this->callKkSyrAndCompareAgainstExpected( - alpha, x.d_view, A.d_view, A.h_view, h_expected.d_view, "non const x"); + this->callKkSyrAndCompareAgainstExpected(alpha, x.d_view, A, + h_expected.d_view, "non const x"); if ((_useAnalyticalResults == false) && // Just to save run time (_kkGerShouldThrowException == false)) { @@ -340,9 +343,8 @@ void SyrTester::test( if (test_cx) { Kokkos::deep_copy(A.d_base, org_A.d_base); - this->callKkSyrAndCompareAgainstExpected(alpha, x.d_view_const, A.d_view, - A.h_view, h_expected.d_view, - "const x"); + this->callKkSyrAndCompareAgainstExpected(alpha, x.d_view_const, A, + h_expected.d_view, "const x"); } // ******************************************************************** @@ -368,42 +370,42 @@ void SyrTester::test( template void SyrTester::populateVariables( - ScalarA& alpha, _HostViewTypeX& h_x, _HostViewTypeA& h_A, - _ViewTypeExpected& h_expected, _ViewTypeX& x, _ViewTypeA& A, + ScalarA& alpha, view_stride_adapter<_ViewTypeX, false>& x, + view_stride_adapter<_ViewTypeA, false>& A, _ViewTypeExpected& h_expected, bool& expectedResultIsKnown) { expectedResultIsKnown = false; if (_useAnalyticalResults) { - this->populateAnalyticalValues(alpha, h_x, h_A, h_expected); - Kokkos::deep_copy(x, h_x); - Kokkos::deep_copy(A, h_A); + this->populateAnalyticalValues(alpha, x.h_view, A.h_view, h_expected); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); expectedResultIsKnown = true; } else if (_N == 1) { alpha = 3; - h_x[0] = 2; + x.h_view[0] = 2; - h_A(0, 0) = 7; + A.h_view(0, 0) = 7; - Kokkos::deep_copy(x, h_x); - Kokkos::deep_copy(A, h_A); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); h_expected(0, 0) = 19; expectedResultIsKnown = true; } else if (_N == 2) { alpha = 3; - h_x[0] = -2; - h_x[1] = 9; + x.h_view[0] = -2; + x.h_view[1] = 9; - h_A(0, 0) = 17; - h_A(0, 1) = -43; - h_A(1, 0) = -43; - h_A(1, 1) = 101; + A.h_view(0, 0) = 17; + A.h_view(0, 1) = -43; + A.h_view(1, 0) = -43; + A.h_view(1, 1) = 101; - Kokkos::deep_copy(x, h_x); - Kokkos::deep_copy(A, h_A); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); if (_useUpOption) { h_expected(0, 0) = 29; @@ -426,17 +428,17 @@ void SyrTester::populateVariables( { ScalarX randStart, randEnd; Test::getRandomBounds(1.0, randStart, randEnd); - Kokkos::fill_random(x, rand_pool, randStart, randEnd); + Kokkos::fill_random(x.d_view, rand_pool, randStart, randEnd); } { ScalarA randStart, randEnd; Test::getRandomBounds(1.0, randStart, randEnd); - Kokkos::fill_random(A, rand_pool, randStart, randEnd); + Kokkos::fill_random(A.d_view, rand_pool, randStart, randEnd); } - Kokkos::deep_copy(h_x, x); - Kokkos::deep_copy(h_A, A); + Kokkos::deep_copy(x.h_base, x.d_base); + Kokkos::deep_copy(A.h_base, A.d_base); if (_useHermitianOption && _A_is_complex) { // **************************************************************** @@ -444,12 +446,12 @@ void SyrTester::populateVariables( // **************************************************************** for (int i(0); i < _N; ++i) { for (int j(i + 1); j < _N; ++j) { - h_A(i, j) = _KAT_A::conj(h_A(j, i)); + A.h_view(i, j) = _KAT_A::conj(A.h_view(j, i)); } } for (int i(0); i < _N; ++i) { - h_A(i, i) = 0.5 * (h_A(i, i) + _KAT_A::conj(h_A(i, i))); + A.h_view(i, i) = 0.5 * (A.h_view(i, i) + _KAT_A::conj(A.h_view(i, i))); } } else { // **************************************************************** @@ -457,18 +459,18 @@ void SyrTester::populateVariables( // **************************************************************** for (int i(0); i < _N; ++i) { for (int j(i + 1); j < _N; ++j) { - h_A(i, j) = h_A(j, i); + A.h_view(i, j) = A.h_view(j, i); } } } - Kokkos::deep_copy(A, h_A); + Kokkos::deep_copy(A.d_base, A.h_base); } #ifdef HAVE_KOKKOSKERNELS_DEBUG if (_N <= 2) { for (int i(0); i < _M; ++i) { for (int j(0); j < _N; ++j) { - std::cout << "h_origA(" << i << "," << j << ")=" << h_A(i, j) + std::cout << "h_origA(" << i << "," << j << ")=" << A.h_view(i, j) << std::endl; } } @@ -1433,10 +1435,9 @@ template template void SyrTester:: - callKkSyrAndCompareAgainstExpected(const ScalarA& alpha, TX& x, - _ViewTypeA& A, const _HostViewTypeA& h_A, - const _ViewTypeExpected& h_expected, - const std::string& situation) { + callKkSyrAndCompareAgainstExpected( + const ScalarA& alpha, TX& x, view_stride_adapter<_ViewTypeA, false>& A, + const _ViewTypeExpected& h_expected, const std::string& situation) { #ifdef HAVE_KOKKOSKERNELS_DEBUG std::cout << "In Test_Blas2_syr, '" << situation << "', alpha = " << alpha << std::endl; @@ -1457,7 +1458,7 @@ void SyrTester:: bool gotStdException(false); bool gotUnknownException(false); try { - KokkosBlas::syr(mode.c_str(), uplo.c_str(), alpha, x, A); + KokkosBlas::syr(mode.c_str(), uplo.c_str(), alpha, x, A.d_view); } catch (const std::exception& e) { #ifdef HAVE_KOKKOSKERNELS_DEBUG std::cout << "In Test_Blas2_syr, '" << situation @@ -1482,8 +1483,8 @@ void SyrTester:: << "have thrown a std::exception"; if ((gotStdException == false) && (gotUnknownException == false)) { - Kokkos::deep_copy(h_A, A); - this->compareKkSyrAgainstReference(alpha, h_A, h_expected); + Kokkos::deep_copy(A.h_base, A.d_base); + this->compareKkSyrAgainstReference(alpha, A.h_view, h_expected); } } diff --git a/blas/unit_test/Test_Blas2_syr2.hpp b/blas/unit_test/Test_Blas2_syr2.hpp new file mode 100644 index 0000000000..c49eba765b --- /dev/null +++ b/blas/unit_test/Test_Blas2_syr2.hpp @@ -0,0 +1,1965 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +// ********************************************************************** +// The tests executed by the code below cover many combinations for +// the operations: +// --> A += alpha * x * y^T + alpha * y * x^T, or +// --> A += alpha * x * y^H + conj(alpha) * y * x^H +// 01) Type of 'x' components: float, double, complex, ... +// 02) Type of 'y' components: float, double, complex, ... +// 03) Type of 'A' components: float, double, complex, ... +// 04) Execution space: serial, threads, OpenMP, Cuda, ... +// 05) Layout of 'x' +// 06) Layout of 'y' +// 07) Layout of 'A' +// 08) Dimension of 'A' +// 09) Options 'const' or 'non const' for x view, when calling syr2() +// 10) Options 'const' or 'non const' for y view, when calling syr2() +// 11) Usage of analytical results in the tests +// 12) Options 'T' or 'H' when calling syr2() +// 13) Options 'U' or 'L' when calling syr2() +// +// Choices (01)-(05) are selected in the routines TEST_F() at the +// very bottom of the file, when calling test_syr2<...>(). +// +// Choices (06)-(13) are selected in routine test_syr2<...>(), +// when calling the method test() of class Test::Syr2Tester<...>. +// +// The class Test::Syr2Tester<...> represents the "core" of the test +// logic, where all calculations, comparisons, and success/failure +// decisions are performed. +// +// A high level explanation of method Test::SyrTester<...>::test() +// is given by the 7 steps named "Step 1 of 7" to "Step 7 of 7" +// in the code below. +// ********************************************************************** + +#include +#include +#include +#include +#include + +namespace Test { + +template +class Syr2Tester { + public: + Syr2Tester(); + + ~Syr2Tester(); + + void test(const int N, const int nonConstConstCombinations, + const bool useAnalyticalResults = false, + const bool useHermitianOption = false, + const bool useUpOption = false); + + private: + using _ViewTypeX = Kokkos::View; + using _ViewTypeY = Kokkos::View; + using _ViewTypeA = Kokkos::View; + + using _HostViewTypeX = typename _ViewTypeX::HostMirror; + using _HostViewTypeY = typename _ViewTypeY::HostMirror; + using _HostViewTypeA = typename _ViewTypeA::HostMirror; + using _ViewTypeExpected = + Kokkos::View; + + using _KAT_A = Kokkos::ArithTraits; + using _AuxType = typename _KAT_A::mag_type; + + void populateVariables(ScalarA& alpha, + view_stride_adapter<_ViewTypeX, false>& x, + view_stride_adapter<_ViewTypeY, false>& y, + view_stride_adapter<_ViewTypeA, false>& A, + _ViewTypeExpected& h_expected, + bool& expectedResultIsKnown); + + template + typename std::enable_if>::value || + std::is_same>::value, + void>::type + populateAnalyticalValues(T& alpha, _HostViewTypeX& h_x, _HostViewTypeY& h_y, + _HostViewTypeA& h_A, _ViewTypeExpected& h_expected); + + template + typename std::enable_if>::value && + !std::is_same>::value, + void>::type + populateAnalyticalValues(T& alpha, _HostViewTypeX& h_x, _HostViewTypeY& h_y, + _HostViewTypeA& h_A, _ViewTypeExpected& h_expected); + + template + typename std::enable_if>::value || + std::is_same>::value, + void>::type + populateVanillaValues(const T& alpha, const _HostViewTypeX& h_x, + const _HostViewTypeY& h_y, const _HostViewTypeA& h_A, + _ViewTypeExpected& h_vanilla); + + template + typename std::enable_if>::value && + !std::is_same>::value, + void>::type + populateVanillaValues(const T& alpha, const _HostViewTypeX& h_x, + const _HostViewTypeY& h_y, const _HostViewTypeA& h_A, + _ViewTypeExpected& h_vanilla); + + template + typename std::enable_if>::value || + std::is_same>::value, + void>::type + compareVanillaAgainstExpected(const T& alpha, + const _ViewTypeExpected& h_vanilla, + const _ViewTypeExpected& h_expected); + + template + typename std::enable_if>::value && + !std::is_same>::value, + void>::type + compareVanillaAgainstExpected(const T& alpha, + const _ViewTypeExpected& h_vanilla, + const _ViewTypeExpected& h_expected); + + template + typename std::enable_if>::value || + std::is_same>::value, + void>::type + compareKkSyr2AgainstReference(const T& alpha, const _HostViewTypeA& h_A, + const _ViewTypeExpected& h_reference); + + template + typename std::enable_if>::value && + !std::is_same>::value, + void>::type + compareKkSyr2AgainstReference(const T& alpha, const _HostViewTypeA& h_A, + const _ViewTypeExpected& h_reference); + + template + T shrinkAngleToZeroTwoPiRange(const T input); + + template + void callKkSyr2AndCompareAgainstExpected( + const ScalarA& alpha, TX& x, TY& y, + view_stride_adapter<_ViewTypeA, false>& A, + const _ViewTypeExpected& h_expected, const std::string& situation); + + template + void callKkGerAndCompareKkSyr2AgainstIt( + const ScalarA& alpha, TX& x, TY& y, + view_stride_adapter<_ViewTypeA, false>& org_A, + const _HostViewTypeA& h_A_syr2, const std::string& situation); + + const bool _A_is_complex; + const bool _A_is_lr; + const bool _A_is_ll; + const bool _testIsGpu; + const bool _vanillaUsesDifferentOrderOfOps; + const _AuxType _absTol; + const _AuxType _relTol; + int _M; + int _N; + bool _useAnalyticalResults; + bool _useHermitianOption; + bool _useUpOption; + bool _kkSyr2ShouldThrowException; + bool _kkGerShouldThrowException; +}; + +template +Syr2Tester::Syr2Tester() + : _A_is_complex(std::is_same>::value || + std::is_same>::value), + _A_is_lr(std::is_same::value), + _A_is_ll(std::is_same::value), + _testIsGpu(KokkosKernels::Impl::kk_is_gpu_exec_space< + typename Device::execution_space>()) +#ifdef KOKKOSKERNELS_ENABLE_TPL_BLAS + , + _vanillaUsesDifferentOrderOfOps(_A_is_lr) +#else + , + _vanillaUsesDifferentOrderOfOps(false) +#endif + , + // **************************************************************** + // Tolerances for double can be tighter than tolerances for float. + // + // In the case of calculations with float, a small amount of + // discrepancies between reference results and CUDA results are + // large enough to require 'relTol' to value 5.0e-3. The same + // calculations show no discrepancies for calculations with double. + // **************************************************************** + _absTol(std::is_same<_AuxType, float>::value + ? 1.0e-6 + : (std::is_same<_AuxType, double>::value ? 1.0e-9 : 0)), + _relTol(std::is_same<_AuxType, float>::value + ? 5.0e-3 + : (std::is_same<_AuxType, double>::value ? 1.0e-6 : 0)), + _M(-1), + _N(-1), + _useAnalyticalResults(false), + _useHermitianOption(false), + _useUpOption(false), + _kkSyr2ShouldThrowException(false), + _kkGerShouldThrowException(false) { +} + +template +Syr2Tester::~Syr2Tester() { + // Nothing to do +} + +template +void Syr2Tester::test(const int N, const int nonConstConstCombinations, + const bool useAnalyticalResults, + const bool useHermitianOption, + const bool useUpOption) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Entering Syr2Tester::test()... - - - - - - - - - - - - - - - - " + "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " + "- - - - - - - - - " + << std::endl; + + std::cout << "_A_is_complex = " << _A_is_complex + << ", _A_is_lr = " << _A_is_lr << ", _A_is_ll = " << _A_is_ll + << ", _testIsGpu = " << _testIsGpu + << ", _vanillaUsesDifferentOrderOfOps = " + << _vanillaUsesDifferentOrderOfOps << ", _absTol = " << _absTol + << ", _relTol = " << _relTol + << ", nonConstConstCombinations = " << nonConstConstCombinations + << ", useAnalyticalResults = " << useAnalyticalResults + << ", useHermitianOption = " << useHermitianOption + << ", useUpOption = " << useUpOption << std::endl; +#endif + // ******************************************************************** + // Step 1 of 7: declare main types and variables + // ******************************************************************** + _M = N; + _N = N; + _useAnalyticalResults = useAnalyticalResults; + _useHermitianOption = useHermitianOption; + _useUpOption = useUpOption; + +#ifdef KOKKOSKERNELS_ENABLE_TPL_BLAS + _kkSyr2ShouldThrowException = false; + + _kkGerShouldThrowException = false; + if (_A_is_complex && _useHermitianOption) { + _kkGerShouldThrowException = !_A_is_ll; + } +#endif + + bool test_x(false); + bool test_cx(false); + if (nonConstConstCombinations == 0) { + test_x = true; + } else if (nonConstConstCombinations == 1) { + test_cx = true; + } else { + test_x = true; + test_cx = true; + } + + view_stride_adapter<_ViewTypeX, false> x("X", _M); + view_stride_adapter<_ViewTypeY, false> y("Y", _N); + view_stride_adapter<_ViewTypeA, false> A("A", _M, _N); + + view_stride_adapter<_ViewTypeExpected, true> h_expected( + "expected A += alpha * x * x^{t,h}", _M, _N); + bool expectedResultIsKnown = false; + + using AlphaCoeffType = typename _ViewTypeA::non_const_value_type; + ScalarA alpha(Kokkos::ArithTraits::zero()); + + // ******************************************************************** + // Step 2 of 7: populate alpha, h_x, h_A, h_expected, x, A + // ******************************************************************** + this->populateVariables(alpha, x, y, A, h_expected.d_view, + expectedResultIsKnown); + + // ******************************************************************** + // Step 3 of 7: populate h_vanilla + // ******************************************************************** + view_stride_adapter<_ViewTypeExpected, true> h_vanilla( + "vanilla = A + alpha * x * x^{t,h}", _M, _N); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2.hpp, computing vanilla A with alpha type = " + << typeid(alpha).name() << std::endl; +#endif + this->populateVanillaValues(alpha, x.h_view, y.h_view, A.h_view, + h_vanilla.d_view); + + // ******************************************************************** + // Step 4 of 7: use h_vanilla and h_expected as appropriate + // ******************************************************************** + if (expectedResultIsKnown) { + // ****************************************************************** + // Compare h_vanilla against h_expected + // ****************************************************************** + this->compareVanillaAgainstExpected(alpha, h_vanilla.d_view, + h_expected.d_view); + } else { + // ****************************************************************** + // Copy h_vanilla to h_expected + // ****************************************************************** + Kokkos::deep_copy(h_expected.d_base, h_vanilla.d_base); + } + + // ******************************************************************** + // Step 5 of 7: test with 'non const x' + // ******************************************************************** + view_stride_adapter<_ViewTypeA, false> org_A("Org_A", _M, _N); + Kokkos::deep_copy(org_A.d_base, A.d_base); + Kokkos::deep_copy(org_A.h_view, A.h_view); + + if (test_x) { + this->callKkSyr2AndCompareAgainstExpected(alpha, x.d_view, y.d_view, A, + h_expected.d_view, "non const x"); + + if ((_useAnalyticalResults == false) && // Just to save run time + (_kkGerShouldThrowException == false)) { + this->callKkGerAndCompareKkSyr2AgainstIt(alpha, x.d_view, y.d_view, org_A, + A.h_view, "non const x"); + } + } + + // ******************************************************************** + // Step 6 of 7: test with const x + // ******************************************************************** + if (test_cx) { + Kokkos::deep_copy(A.d_base, org_A.d_base); + + this->callKkSyr2AndCompareAgainstExpected( + alpha, x.d_view_const, y.d_view_const, A, h_expected.d_view, "const x"); + } + + // ******************************************************************** + // Step 7 of 7: tests with invalid values on the first input parameter + // ******************************************************************** + EXPECT_ANY_THROW( + KokkosBlas::syr2(".", "U", alpha, x.d_view, y.d_view, A.d_view)) + << "Failed test: kk syr2 should have thrown an exception for mode '.'"; + EXPECT_ANY_THROW( + KokkosBlas::syr2("", "U", alpha, x.d_view, y.d_view, A.d_view)) + << "Failed test: kk syr2 should have thrown an exception for mode ''"; + EXPECT_ANY_THROW( + KokkosBlas::syr2("T", ".", alpha, x.d_view, y.d_view, A.d_view)) + << "Failed test: kk syr2 should have thrown an exception for uplo '.'"; + EXPECT_ANY_THROW( + KokkosBlas::syr2("T", "", alpha, x.d_view, y.d_view, A.d_view)) + << "Failed test: kk syr2 should have thrown an exception for uplo ''"; + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Leaving Syr2Tester::test() - - - - - - - - - - - - - - - - - - " + "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " + "- - - - - - - " + << std::endl; +#endif +} + +template +void Syr2Tester< + ScalarX, tLayoutX, ScalarY, tLayoutY, ScalarA, tLayoutA, + Device>::populateVariables(ScalarA& alpha, + view_stride_adapter<_ViewTypeX, false>& x, + view_stride_adapter<_ViewTypeY, false>& y, + view_stride_adapter<_ViewTypeA, false>& A, + _ViewTypeExpected& h_expected, + bool& expectedResultIsKnown) { + expectedResultIsKnown = false; + + if (_useAnalyticalResults) { + this->populateAnalyticalValues(alpha, x.h_view, y.h_view, A.h_view, + h_expected); + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(y.d_base, y.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); + + expectedResultIsKnown = true; + } else if (_N == 1) { + alpha = 3; + + x.h_view[0] = 2; + + y.h_view[0] = 4; + + A.h_view(0, 0) = 7; + + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(y.d_base, y.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); + + h_expected(0, 0) = 55; + expectedResultIsKnown = true; + } else if (_N == 2) { + alpha = 3; + + x.h_view[0] = -2; + x.h_view[1] = 9; + + y.h_view[0] = 5; + y.h_view[1] = -4; + + A.h_view(0, 0) = 17; + A.h_view(0, 1) = -43; + A.h_view(1, 0) = -43; + A.h_view(1, 1) = 101; + + Kokkos::deep_copy(x.d_base, x.h_base); + Kokkos::deep_copy(y.d_base, y.h_base); + Kokkos::deep_copy(A.d_base, A.h_base); + + if (_useUpOption) { + h_expected(0, 0) = -43; + h_expected(0, 1) = 116; + h_expected(1, 0) = -43; + h_expected(1, 1) = -115; + } else { + h_expected(0, 0) = -43; + h_expected(0, 1) = -43; + h_expected(1, 0) = 116; + h_expected(1, 1) = -115; + } + expectedResultIsKnown = true; + } else { + alpha = 3; + + Kokkos::Random_XorShift64_Pool rand_pool( + 13718); + + { + ScalarX randStart, randEnd; + Test::getRandomBounds(1.0, randStart, randEnd); + Kokkos::fill_random(x.d_view, rand_pool, randStart, randEnd); + } + + { + ScalarY randStart, randEnd; + Test::getRandomBounds(1.0, randStart, randEnd); + Kokkos::fill_random(y.d_view, rand_pool, randStart, randEnd); + } + + { + ScalarA randStart, randEnd; + Test::getRandomBounds(1.0, randStart, randEnd); + Kokkos::fill_random(A.d_view, rand_pool, randStart, randEnd); + } + + Kokkos::deep_copy(x.h_base, x.d_base); + Kokkos::deep_copy(y.h_base, y.d_base); + Kokkos::deep_copy(A.h_base, A.d_base); + + if (_useHermitianOption && _A_is_complex) { + // **************************************************************** + // Make h_A Hermitian + // **************************************************************** + for (int i(0); i < _N; ++i) { + for (int j(i + 1); j < _N; ++j) { + A.h_view(i, j) = _KAT_A::conj(A.h_view(j, i)); + } + } + + for (int i(0); i < _N; ++i) { + A.h_view(i, i) = 0.5 * (A.h_view(i, i) + _KAT_A::conj(A.h_view(i, i))); + } + } else { + // **************************************************************** + // Make h_A symmetric + // **************************************************************** + for (int i(0); i < _N; ++i) { + for (int j(i + 1); j < _N; ++j) { + A.h_view(i, j) = A.h_view(j, i); + } + } + } + Kokkos::deep_copy(A.d_base, A.h_base); + } + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (_N <= 2) { + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + std::cout << "h_origA(" << i << "," << j << ") = " << A.h_view(i, j) + << std::endl; + } + } + } +#endif +} + +// Code for complex values +template +template +typename std::enable_if>::value || + std::is_same>::value, + void>::type +Syr2Tester::populateAnalyticalValues(T& alpha, _HostViewTypeX& h_x, + _HostViewTypeY& h_y, + _HostViewTypeA& h_A, + _ViewTypeExpected& h_expected) { + alpha.real() = 1.4; + alpha.imag() = -2.3; + + for (int i = 0; i < _M; ++i) { + _AuxType auxI = this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i)); + h_x[i].real() = sin(auxI); + h_x[i].imag() = sin(auxI); + } + + for (int i = 0; i < _M; ++i) { + _AuxType auxI = this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i)); + h_y[i].real() = cos(auxI); + h_y[i].imag() = cos(auxI); + } + + if (_useHermitianOption) { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + _AuxType auxIpJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i + j)); + _AuxType auxImJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i - j)); + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_A(i, j).real() = sin(auxIpJ); + h_A(i, j).imag() = -sin(auxImJ); + } else { + h_A(i, j).real() = sin(auxIpJ); + h_A(i, j).imag() = sin(auxImJ); + } + } + } + } else { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + _AuxType auxIpJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i + j)); + h_A(i, j).real() = sin(auxIpJ); + h_A(i, j).imag() = sin(auxIpJ); + } + } + } + + if (_useHermitianOption) { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + _AuxType auxIpJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i + j)); + _AuxType auxImJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i - j)); + h_expected(i, j).real() = 3.8 * sin(auxIpJ); + h_expected(i, j).imag() = -5.6 * sin(auxImJ); + } else { + h_expected(i, j).real() = h_A(i, j).real(); + h_expected(i, j).imag() = h_A(i, j).imag(); + } + } + } + } else { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + _AuxType auxIpJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i + j)); + h_expected(i, j).real() = 5.6 * sin(auxIpJ); + h_expected(i, j).imag() = 3.8 * sin(auxIpJ); + } else { + h_expected(i, j).real() = h_A(i, j).real(); + h_expected(i, j).imag() = h_A(i, j).imag(); + } + } + } + } +} + +// Code for non-complex values +template +template +typename std::enable_if>::value && + !std::is_same>::value, + void>::type +Syr2Tester::populateAnalyticalValues(T& alpha, _HostViewTypeX& h_x, + _HostViewTypeY& h_y, + _HostViewTypeA& h_A, + _ViewTypeExpected& h_expected) { + alpha = std::is_same<_AuxType, int>::value ? 1 : 1.1; + + for (int i = 0; i < _M; ++i) { + _AuxType auxI = this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i)); + h_x[i] = sin(auxI); + } + + for (int i = 0; i < _M; ++i) { + _AuxType auxI = this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i)); + h_y[i] = cos(auxI); + } + + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + _AuxType auxIpJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i + j)); + h_A(i, j) = .1 * sin(auxIpJ); + } + } + + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + _AuxType auxIpJ = + this->shrinkAngleToZeroTwoPiRange(static_cast<_AuxType>(i + j)); + h_expected(i, j) = 1.2 * sin(auxIpJ); + } else { + h_expected(i, j) = h_A(i, j); + } + } + } +} + +// Code for complex values +template +template +typename std::enable_if>::value || + std::is_same>::value, + void>::type +Syr2Tester::populateVanillaValues(const T& alpha, + const _HostViewTypeX& h_x, + const _HostViewTypeY& h_y, + const _HostViewTypeA& h_A, + _ViewTypeExpected& h_vanilla) { + if (_vanillaUsesDifferentOrderOfOps) { + if (_useHermitianOption) { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * _KAT_A::conj(h_y(j)) * h_x(i) + + _KAT_A::conj(alpha) * _KAT_A::conj(h_x(j)) * h_y(i); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + for (int i = 0; i < _N; ++i) { + h_vanilla(i, i).imag() = 0.; + } + } else { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(j) * h_y(i) + alpha * h_y(j) * h_x(i); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + } + } else { + if (_useHermitianOption) { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(i) * _KAT_A::conj(h_y(j)) + + _KAT_A::conj(alpha) * h_y(i) * _KAT_A::conj(h_x(j)); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + for (int i = 0; i < _N; ++i) { + h_vanilla(i, i).imag() = 0.; + } + } else { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(i) * h_y(j) + alpha * h_y(i) * h_x(j); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + } + } +} + +// Code for non-complex values +template +template +typename std::enable_if>::value && + !std::is_same>::value, + void>::type +Syr2Tester::populateVanillaValues(const T& alpha, + const _HostViewTypeX& h_x, + const _HostViewTypeY& h_y, + const _HostViewTypeA& h_A, + _ViewTypeExpected& h_vanilla) { + if (_useHermitianOption) { + if (_vanillaUsesDifferentOrderOfOps) { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(j) * _KAT_A::conj(h_y(i)) + + _KAT_A::conj(alpha) * h_y(j) * _KAT_A::conj(h_x(i)); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + } else { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(i) * _KAT_A::conj(h_y(j)) + + _KAT_A::conj(alpha) * h_y(i) * _KAT_A::conj(h_x(j)); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + } + } else { + if (_vanillaUsesDifferentOrderOfOps) { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(j) * h_y(i) + alpha * h_y(j) * h_x(i); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + } else { + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + h_vanilla(i, j) = + h_A(i, j) + alpha * h_x(i) * h_y(j) + alpha * h_y(i) * h_x(j); + } else { + h_vanilla(i, j) = h_A(i, j); + } + } + } + } + } +} + +template +template +T Syr2Tester::shrinkAngleToZeroTwoPiRange(const T input) { + T output(input); +#if 0 + T twoPi( 2. * Kokkos::numbers::pi ); + if (input > 0.) { + output -= std::floor( input / twoPi ) * twoPi; + } + else if (input < 0.) { + output += std::floor( -input / twoPi ) * twoPi; + } +#endif + return output; +} + +// Code for complex values +template +template +typename std::enable_if>::value || + std::is_same>::value, + void>::type +Syr2Tester:: + compareVanillaAgainstExpected(const T& alpha, + const _ViewTypeExpected& h_vanilla, + const _ViewTypeExpected& h_expected) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (_N <= 2) { + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + std::cout << "h_exp(" << i << "," << j << ") = " << h_expected(i, j) + << ", h_van(" << i << "," << j << ") = " << h_vanilla(i, j) + << std::endl; + } + } + } +#endif + int maxNumErrorsAllowed(static_cast(_M) * static_cast(_N) * + 1.e-3); + + if (_useAnalyticalResults) { + int numErrorsRealAbs(0); + int numErrorsRealRel(0); + int numErrorsImagAbs(0); + int numErrorsImagRel(0); + _AuxType diff(0.); + _AuxType diffThreshold(0.); + bool errorHappened(false); + _AuxType maxErrorRealRel(0.); + int iForMaxErrorRealRel(0); + int jForMaxErrorRealRel(0); + _AuxType maxErrorImagRel(0.); + int iForMaxErrorImagRel(0); + int jForMaxErrorImagRel(0); + + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + diff = _KAT_A::abs(h_expected(i, j).real() - h_vanilla(i, j).real()); + errorHappened = false; + if (h_expected(i, j).real() == 0.) { + diffThreshold = _KAT_A::abs(_absTol); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsRealAbs++; + } + } else { + _AuxType aux = diff / _KAT_A::abs(h_expected(i, j).real()); + if (maxErrorRealRel < aux) { + maxErrorRealRel = aux; + iForMaxErrorRealRel = i; + jForMaxErrorRealRel = j; + } + + diffThreshold = _KAT_A::abs(_relTol * h_expected(i, j).real()); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsRealRel++; + } + } + if (errorHappened && (numErrorsRealAbs + numErrorsRealRel == 1)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_expected(i,j).real() = " << h_expected(i, j).real() + << ", h_vanilla(i,j).real() = " << h_vanilla(i, j).real() + << ", _KAT_A::abs(h_expected(i,j).real() - " + "h_vanilla(i,j).real()) = " + << diff << ", diffThreshold = " << diffThreshold + << std::endl; +#endif + } + diff = _KAT_A::abs(h_expected(i, j).imag() - h_vanilla(i, j).imag()); + errorHappened = false; + if (h_expected(i, j).imag() == 0.) { + diffThreshold = _KAT_A::abs(_absTol); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsImagAbs++; + } + } else { + _AuxType aux = diff / _KAT_A::abs(h_expected(i, j).imag()); + if (maxErrorImagRel < aux) { + maxErrorImagRel = aux; + iForMaxErrorImagRel = i; + jForMaxErrorImagRel = j; + } + + diffThreshold = _KAT_A::abs(_relTol * h_expected(i, j).imag()); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsImagRel++; + } + } + if (errorHappened && (numErrorsImagAbs + numErrorsImagRel == 1)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_expected(i,j).imag() = " << h_expected(i, j).imag() + << ", h_vanilla(i,j).imag() = " << h_vanilla(i, j).imag() + << ", _KAT_A::abs(h_expected(i,j).imag() - " + "h_vanilla(i,j).imag()) = " + << diff << ", diffThreshold = " << diffThreshold + << std::endl; +#endif + } + } // for j + } // for i + + { + std::ostringstream msg; + msg << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": vanilla differs too much from analytical on real components" + << ", numErrorsRealAbs = " << numErrorsRealAbs + << ", numErrorsRealRel = " << numErrorsRealRel + << ", maxErrorRealRel = " << maxErrorRealRel + << ", iForMaxErrorRealRel = " << iForMaxErrorRealRel + << ", jForMaxErrorRealRel = " << jForMaxErrorRealRel + << ", h_expected(i,j).real() = " + << (((_M > 0) && (_N > 0)) + ? h_expected(iForMaxErrorRealRel, jForMaxErrorRealRel).real() + : 9.999e+99) + << ", h_vanilla(i,j).real() = " + << (((_M > 0) && (_N > 0)) + ? h_vanilla(iForMaxErrorRealRel, jForMaxErrorRealRel).real() + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed; + + int numErrorsReal(numErrorsRealAbs + numErrorsRealRel); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrorsReal > 0) { + std::cout << "WARNING" << msg.str() << std::endl; + } +#endif + EXPECT_LE(numErrorsReal, maxNumErrorsAllowed) + << "Failed test" << msg.str(); + } + { + std::ostringstream msg; + msg << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": vanilla differs too much from analytical on imag components" + << ", numErrorsImagAbs = " << numErrorsImagAbs + << ", numErrorsImagRel = " << numErrorsImagRel + << ", maxErrorImagRel = " << maxErrorImagRel + << ", iForMaxErrorImagRel = " << iForMaxErrorImagRel + << ", jForMaxErrorImagRel = " << jForMaxErrorImagRel + << ", h_expected(i,j).imag() = " + << (((_M > 0) && (_N > 0)) + ? h_expected(iForMaxErrorImagRel, jForMaxErrorImagRel).imag() + : 9.999e+99) + << ", h_vanilla(i,j).imag() = " + << (((_M > 0) && (_N > 0)) + ? h_vanilla(iForMaxErrorImagRel, jForMaxErrorImagRel).imag() + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed; + + int numErrorsImag(numErrorsImagAbs + numErrorsImagRel); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrorsImag > 0) { + std::cout << "WARNING" << msg.str() << std::endl; + } +#endif + EXPECT_LE(numErrorsImag, maxNumErrorsAllowed) + << "Failed test" << msg.str(); + } + } else { + int numErrorsReal(0); + int numErrorsImag(0); + + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + if (h_expected(i, j).real() != h_vanilla(i, j).real()) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrorsReal == 0) { + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_expected(i,j).real() = " + << h_expected(i, j).real() + << ", h_vanilla(i,j).real() = " << h_vanilla(i, j).real() + << std::endl; + } +#endif + numErrorsReal++; + } + + if (h_expected(i, j).imag() != h_vanilla(i, j).imag()) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrorsImag == 0) { + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_expected(i,j).imag() = " + << h_expected(i, j).imag() + << ", h_vanilla(i,j).imag() = " << h_vanilla(i, j).imag() + << std::endl; + } +#endif + numErrorsImag++; + } + } // for j + } // for i + EXPECT_EQ(numErrorsReal, 0) + << "Failed test" + << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": vanilla result is incorrect on real components" + << ", numErrorsReal = " << numErrorsReal; + EXPECT_EQ(numErrorsImag, 0) + << "Failed test" + << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": vanilla result is incorrect on imag components" + << ", numErrorsImag = " << numErrorsImag; + } +} + +// Code for non-complex values +template +template +typename std::enable_if>::value && + !std::is_same>::value, + void>::type +Syr2Tester:: + compareVanillaAgainstExpected(const T& alpha, + const _ViewTypeExpected& h_vanilla, + const _ViewTypeExpected& h_expected) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (_N <= 2) { + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + std::cout << "h_exp(" << i << "," << j << ") = " << h_expected(i, j) + << ", h_van(" << i << "," << j << ") = " << h_vanilla(i, j) + << std::endl; + } + } + } +#endif + int maxNumErrorsAllowed(static_cast(_M) * static_cast(_N) * + 1.e-3); + + if (_useAnalyticalResults) { + int numErrorsAbs(0); + int numErrorsRel(0); + _AuxType diff(0.); + _AuxType diffThreshold(0.); + bool errorHappened(false); + _AuxType maxErrorRel(0.); + int iForMaxErrorRel(0); + int jForMaxErrorRel(0); + + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + diff = _KAT_A::abs(h_expected(i, j) - h_vanilla(i, j)); + errorHappened = false; + if (h_expected(i, j) == 0.) { + diffThreshold = _KAT_A::abs(_absTol); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsAbs++; + } + } else { + _AuxType aux = diff / _KAT_A::abs(h_expected(i, j)); + if (maxErrorRel < aux) { + maxErrorRel = aux; + iForMaxErrorRel = i; + jForMaxErrorRel = j; + } + + diffThreshold = _KAT_A::abs(_relTol * h_expected(i, j)); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsRel++; + } + } + if (errorHappened && (numErrorsAbs + numErrorsRel == 1)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_expected(i,j) = " << h_expected(i, j) + << ", h_vanilla(i,j) = " << h_vanilla(i, j) + << ", _KAT_A::abs(h_expected(i,j) - h_vanilla(i,j)) = " + << diff << ", diffThreshold = " << diffThreshold + << std::endl; +#endif + } + } // for j + } // for i + + { + std::ostringstream msg; + msg << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": vanilla differs too much from expected" + << ", numErrorsAbs = " << numErrorsAbs + << ", numErrorsRel = " << numErrorsRel + << ", maxErrorRel = " << maxErrorRel + << ", iForMaxErrorRel = " << iForMaxErrorRel + << ", jForMaxErrorRel = " << jForMaxErrorRel << ", h_expected(i,j) = " + << (((_M > 0) && (_N > 0)) + ? h_expected(iForMaxErrorRel, jForMaxErrorRel) + : 9.999e+99) + << ", h_vanilla(i,j) = " + << (((_M > 0) && (_N > 0)) + ? h_vanilla(iForMaxErrorRel, jForMaxErrorRel) + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed; + + int numErrors(numErrorsAbs + numErrorsRel); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrors > 0) { + std::cout << "WARNING" << msg.str() << std::endl; + } +#endif + EXPECT_LE(numErrors, maxNumErrorsAllowed) << "Failed test" << msg.str(); + } + } else { + int numErrors(0); + + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + if (h_expected(i, j) != h_vanilla(i, j)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrors == 0) { + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_expected(i,j) = " << h_expected(i, j) + << ", h_vanilla(i,j) = " << h_vanilla(i, j) << std::endl; + } +#endif + numErrors++; + } + } // for j + } // for i + EXPECT_EQ(numErrors, 0) + << "Failed test" + << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": vanilla result is incorrect" + << ", numErrors = " << numErrors; + } +} + +// Code for complex values +template +template +typename std::enable_if>::value || + std::is_same>::value, + void>::type +Syr2Tester:: + compareKkSyr2AgainstReference(const T& alpha, const _HostViewTypeA& h_A, + const _ViewTypeExpected& h_reference) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (_N <= 2) { + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + std::cout << "h_exp(" << i << "," << j << ") = " << h_reference(i, j) + << ", h_A(" << i << "," << j << ") = " << h_A(i, j) + << std::endl; + } + } + } +#endif + int maxNumErrorsAllowed(static_cast(_M) * static_cast(_N) * + 1.e-3); + + int numErrorsRealAbs(0); + int numErrorsRealRel(0); + int numErrorsImagAbs(0); + int numErrorsImagRel(0); + _AuxType diff(0.); + _AuxType diffThreshold(0.); + bool errorHappened(false); + _AuxType maxErrorRealRel(0.); + int iForMaxErrorRealRel(0); + int jForMaxErrorRealRel(0); + _AuxType maxErrorImagRel(0.); + int iForMaxErrorImagRel(0); + int jForMaxErrorImagRel(0); + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + diff = _KAT_A::abs(h_reference(i, j).real() - h_A(i, j).real()); + errorHappened = false; + if (h_reference(i, j).real() == 0.) { + diffThreshold = _KAT_A::abs(_absTol); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsRealAbs++; + } + } else { + _AuxType aux = diff / _KAT_A::abs(h_reference(i, j).real()); + if (maxErrorRealRel < aux) { + maxErrorRealRel = aux; + iForMaxErrorRealRel = i; + jForMaxErrorRealRel = j; + } + + diffThreshold = _KAT_A::abs(_relTol * h_reference(i, j).real()); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsRealRel++; + } + } + if (errorHappened && (numErrorsRealAbs + numErrorsRealRel == 1)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout + << "ERROR, i = " << i << ", j = " << j + << ": h_reference(i,j).real() = " << h_reference(i, j).real() + << ", h_A(i,j).real() = " << h_A(i, j).real() + << ", _KAT_A::abs(h_reference(i,j).real() - h_A(i,j).real()) = " + << diff << ", diffThreshold = " << diffThreshold << std::endl; +#endif + } + diff = _KAT_A::abs(h_reference(i, j).imag() - h_A(i, j).imag()); + errorHappened = false; + if (h_reference(i, j).imag() == 0.) { + diffThreshold = _KAT_A::abs(_absTol); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsImagAbs++; + } + } else { + _AuxType aux = diff / _KAT_A::abs(h_reference(i, j).imag()); + if (maxErrorImagRel < aux) { + maxErrorImagRel = aux; + iForMaxErrorImagRel = i; + jForMaxErrorImagRel = j; + } + + diffThreshold = _KAT_A::abs(_relTol * h_reference(i, j).imag()); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsImagRel++; + } + } + if (errorHappened && (numErrorsImagAbs + numErrorsImagRel == 1)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout + << "ERROR, i = " << i << ", j = " << j + << ": h_reference(i,j).imag() = " << h_reference(i, j).imag() + << ", h_A(i,j).imag() = " << h_A(i, j).imag() + << ", _KAT_A::abs(h_reference(i,j).imag() - h_A(i,j).imag()) = " + << diff << ", diffThreshold = " << diffThreshold << std::endl; +#endif + } + } // for j + } // for i + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout + << "A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ", numErrorsRealAbs = " << numErrorsRealAbs + << ", numErrorsRealRel = " << numErrorsRealRel + << ", maxErrorRealRel = " << maxErrorRealRel + << ", iForMaxErrorRealRel = " << iForMaxErrorRealRel + << ", jForMaxErrorRealRel = " << jForMaxErrorRealRel + << ", h_reference(i,j).real() = " + << (((_M > 0) && (_N > 0)) + ? h_reference(iForMaxErrorRealRel, jForMaxErrorRealRel).real() + : 9.999e+99) + << ", h_A(i,j).real() = " + << (((_M > 0) && (_N > 0)) + ? h_A(iForMaxErrorRealRel, jForMaxErrorRealRel).real() + : 9.999e+99) + << ", numErrorsImagAbs = " << numErrorsImagAbs + << ", numErrorsImagRel = " << numErrorsImagRel + << ", maxErrorImagRel = " << maxErrorImagRel + << ", iForMaxErrorImagRel = " << iForMaxErrorImagRel + << ", jForMaxErrorImagRel = " << jForMaxErrorImagRel + << ", h_reference(i,j).imag() = " + << (((_M > 0) && (_N > 0)) + ? h_reference(iForMaxErrorImagRel, jForMaxErrorImagRel).imag() + : 9.999e+99) + << ", h_A(i,j).imag() = " + << (((_M > 0) && (_N > 0)) + ? h_A(iForMaxErrorImagRel, jForMaxErrorImagRel).imag() + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed << std::endl; + if ((_M == 2131) && (_N == 2131)) { + std::cout << "Information" + << ": A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ", h_reference(11, 2119) = (" << h_reference(11, 2119).real() + << ", " << h_reference(11, 2119).imag() << ")" + << ", h_A(11, 2119) = (" << h_A(11, 2119).real() << ", " + << h_A(11, 2119).imag() << ")" << std::endl; + std::cout << "Information" + << ": A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ", h_reference(710, 1065) = (" << h_reference(710, 1065).real() + << ", " << h_reference(710, 1065).imag() << ")" + << ", h_A(710, 1065) = (" << h_A(710, 1065).real() << ", " + << h_A(710, 1065).imag() << ")" << std::endl; + } +#endif + { + std::ostringstream msg; + msg << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": syr2 result is incorrect on real components" + << ", numErrorsRealAbs = " << numErrorsRealAbs + << ", numErrorsRealRel = " << numErrorsRealRel + << ", maxErrorRealRel = " << maxErrorRealRel + << ", iForMaxErrorRealRel = " << iForMaxErrorRealRel + << ", jForMaxErrorRealRel = " << jForMaxErrorRealRel + << ", h_reference(i,j).real() = " + << (((_M > 0) && (_N > 0)) + ? h_reference(iForMaxErrorRealRel, jForMaxErrorRealRel).real() + : 9.999e+99) + << ", h_A(i,j).real() = " + << (((_M > 0) && (_N > 0)) + ? h_A(iForMaxErrorRealRel, jForMaxErrorRealRel).real() + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed; + + int numErrorsReal(numErrorsRealAbs + numErrorsRealRel); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrorsReal > 0) { + std::cout << "WARNING" << msg.str() << std::endl; + } +#endif + EXPECT_LE(numErrorsReal, maxNumErrorsAllowed) << "Failed test" << msg.str(); + } + { + std::ostringstream msg; + msg << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ": syr2 result is incorrect on imag components" + << ", numErrorsImagAbs = " << numErrorsImagAbs + << ", numErrorsImagRel = " << numErrorsImagRel + << ", maxErrorImagRel = " << maxErrorImagRel + << ", iForMaxErrorImagRel = " << iForMaxErrorImagRel + << ", jForMaxErrorImagRel = " << jForMaxErrorImagRel + << ", h_reference(i,j).imag() = " + << (((_M > 0) && (_N > 0)) + ? h_reference(iForMaxErrorImagRel, jForMaxErrorImagRel).imag() + : 9.999e+99) + << ", h_A(i,j).imag() = " + << (((_M > 0) && (_N > 0)) + ? h_A(iForMaxErrorImagRel, jForMaxErrorImagRel).imag() + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed; + + int numErrorsImag(numErrorsImagAbs + numErrorsImagRel); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrorsImag > 0) { + std::cout << "WARNING" << msg.str() << std::endl; + } +#endif + EXPECT_LE(numErrorsImag, maxNumErrorsAllowed) << "Failed test" << msg.str(); + } +} + +// Code for non-complex values +template +template +typename std::enable_if>::value && + !std::is_same>::value, + void>::type +Syr2Tester:: + compareKkSyr2AgainstReference(const T& alpha, const _HostViewTypeA& h_A, + const _ViewTypeExpected& h_reference) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (_N <= 2) { + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + std::cout << "h_exp(" << i << "," << j << ") = " << h_reference(i, j) + << ", h_A(" << i << "," << j << ") = " << h_A(i, j) + << std::endl; + } + } + } +#endif + int maxNumErrorsAllowed(static_cast(_M) * static_cast(_N) * + 1.e-3); + + int numErrorsAbs(0); + int numErrorsRel(0); + _AuxType diff(0.); + _AuxType diffThreshold(0.); + bool errorHappened(false); + _AuxType maxErrorRel(0.); + int iForMaxErrorRel(0); + int jForMaxErrorRel(0); + for (int i(0); i < _M; ++i) { + for (int j(0); j < _N; ++j) { + diff = _KAT_A::abs(h_reference(i, j) - h_A(i, j)); + errorHappened = false; + if (h_reference(i, j) == 0.) { + diffThreshold = _KAT_A::abs(_absTol); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsAbs++; + } + } else { + _AuxType aux = diff / _KAT_A::abs(h_reference(i, j)); + if (maxErrorRel < aux) { + maxErrorRel = aux; + iForMaxErrorRel = i; + jForMaxErrorRel = j; + } + + diffThreshold = _KAT_A::abs(_relTol * h_reference(i, j)); + if (diff > diffThreshold) { + errorHappened = true; + numErrorsRel++; + } + } + if (errorHappened && (numErrorsAbs + numErrorsRel == 1)) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "ERROR, i = " << i << ", j = " << j + << ": h_reference(i,j) = " << h_reference(i, j) + << ", h_A(i,j) = " << h_A(i, j) + << ", _KAT_A::abs(h_reference(i,j) - h_A(i,j)) = " << diff + << ", diffThreshold = " << diffThreshold << std::endl; +#endif + } + } // for j + } // for i +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption + << ", numErrorsAbs = " << numErrorsAbs + << ", numErrorsRel = " << numErrorsRel + << ", maxErrorRel = " << maxErrorRel + << ", iForMaxErrorRel = " << iForMaxErrorRel + << ", jForMaxErrorRel = " << jForMaxErrorRel + << ", h_reference(i,j) = " + << (((_M > 0) && (_N > 0)) + ? h_reference(iForMaxErrorRel, jForMaxErrorRel) + : 9.999e+99) + << ", h_A(i,j) = " + << (((_M > 0) && (_N > 0)) ? h_A(iForMaxErrorRel, jForMaxErrorRel) + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed << std::endl; +#endif + { + std::ostringstream msg; + msg << ", A is " << _M << " by " << _N << ", _A_is_lr = " << _A_is_lr + << ", _A_is_ll = " << _A_is_ll + << ", alpha type = " << typeid(alpha).name() + << ", _useHermitianOption = " << _useHermitianOption + << ", _useUpOption = " << _useUpOption << ": syr2 result is incorrect" + << ", numErrorsAbs = " << numErrorsAbs + << ", numErrorsRel = " << numErrorsRel + << ", maxErrorRel = " << maxErrorRel + << ", iForMaxErrorRel = " << iForMaxErrorRel + << ", jForMaxErrorRel = " << jForMaxErrorRel << ", h_reference(i,j) = " + << (((_M > 0) && (_N > 0)) + ? h_reference(iForMaxErrorRel, jForMaxErrorRel) + : 9.999e+99) + << ", h_A(i,j) = " + << (((_M > 0) && (_N > 0)) ? h_A(iForMaxErrorRel, jForMaxErrorRel) + : 9.999e+99) + << ", maxNumErrorsAllowed = " << maxNumErrorsAllowed; + + int numErrors(numErrorsAbs + numErrorsRel); +#ifdef HAVE_KOKKOSKERNELS_DEBUG + if (numErrors > 0) { + std::cout << "WARNING" << msg.str() << std::endl; + } +#endif + EXPECT_LE(numErrors, maxNumErrorsAllowed) << "Failed test" << msg.str(); + } +} + +template +template +void Syr2Tester:: + callKkSyr2AndCompareAgainstExpected( + const ScalarA& alpha, TX& x, TY& y, + view_stride_adapter<_ViewTypeA, false>& A, + const _ViewTypeExpected& h_expected, const std::string& situation) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation << "', alpha = " << alpha + << std::endl; + std::cout << "In Test_Blas2_syr2.hpp, right before calling KokkosBlas::syr2()" + << ": ViewTypeA = " << typeid(_ViewTypeA).name() + << ", _kkSyr2ShouldThrowException = " << _kkSyr2ShouldThrowException + << std::endl; +#endif + std::string mode = _useHermitianOption ? "H" : "T"; + std::string uplo = _useUpOption ? "U" : "L"; + bool gotStdException(false); + bool gotUnknownException(false); + try { + KokkosBlas::syr2(mode.c_str(), uplo.c_str(), alpha, x, y, A.d_view); + Kokkos::fence(); + } catch (const std::exception& e) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation + << "': caught exception, e.what() = " << e.what() << std::endl; +#endif + gotStdException = true; + } catch (...) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation + << "': caught unknown exception" << std::endl; +#endif + gotUnknownException = true; + } + + EXPECT_EQ(gotUnknownException, false) + << "Failed test, '" << situation + << "': unknown exception should not have happened"; + + EXPECT_EQ(gotStdException, _kkSyr2ShouldThrowException) + << "Failed test, '" << situation << "': kk syr2() should" + << (_kkSyr2ShouldThrowException ? " " : " not ") + << "have thrown a std::exception"; + + if ((gotStdException == false) && (gotUnknownException == false)) { + Kokkos::deep_copy(A.h_base, A.d_base); + this->compareKkSyr2AgainstReference(alpha, A.h_view, h_expected); + } +} + +template +template +void Syr2Tester:: + callKkGerAndCompareKkSyr2AgainstIt( + const ScalarA& alpha, TX& x, TY& y, + view_stride_adapter<_ViewTypeA, false>& org_A, + const _HostViewTypeA& h_A_syr2, const std::string& situation) { + view_stride_adapter<_ViewTypeA, false> A_ger("A_ger", _M, _N); + Kokkos::deep_copy(A_ger.d_base, org_A.d_base); + + // ******************************************************************** + // Call ger() + // ******************************************************************** +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation << "', alpha = " << alpha + << std::endl; + std::cout << "In Test_Blas2_syr2.hpp, right before calling KokkosBlas::ger()" + << ": ViewTypeA = " << typeid(_ViewTypeA).name() + << ", _kkGerShouldThrowException = " << _kkGerShouldThrowException + << std::endl; +#endif + std::string mode = _useHermitianOption ? "H" : "T"; + bool gotStdException(false); + bool gotUnknownException(false); + try { + KokkosBlas::ger(mode.c_str(), alpha, x, y, A_ger.d_view); + Kokkos::fence(); + } catch (const std::exception& e) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation + << "', ger() call 1: caught exception, e.what() = " << e.what() + << std::endl; +#endif + gotStdException = true; + } catch (...) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation + << "', ger() call 1: caught unknown exception" << std::endl; +#endif + gotUnknownException = true; + } + + EXPECT_EQ(gotUnknownException, false) + << "Failed test, '" << situation + << "': unknown exception should not have happened for ger() call 1"; + + EXPECT_EQ(gotStdException, false) + << "Failed test, '" << situation + << "': kk ger() 1 should not have thrown a std::exception"; + + // ******************************************************************** + // Call ger() again + // ******************************************************************** +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout + << "In Test_Blas2_syr2.hpp, right before calling KokkosBlas::ger() again"; +#endif + try { + if (_useHermitianOption) { + KokkosBlas::ger(mode.c_str(), _KAT_A::conj(alpha), y, x, A_ger.d_view); + } else { + KokkosBlas::ger(mode.c_str(), alpha, y, x, A_ger.d_view); + } + Kokkos::fence(); + } catch (const std::exception& e) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation + << "', ger() call 2: caught exception, e.what() = " << e.what() + << std::endl; +#endif + gotStdException = true; + } catch (...) { +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "In Test_Blas2_syr2, '" << situation + << "', ger() call 2: caught unknown exception" << std::endl; +#endif + gotUnknownException = true; + } + + EXPECT_EQ(gotUnknownException, false) + << "Failed test, '" << situation + << "': unknown exception should not have happened for ger() call 2"; + + EXPECT_EQ(gotStdException, false) + << "Failed test, '" << situation + << "': kk ger() 2 should not have thrown a std::exception"; + + // ******************************************************************** + // Prepare h_ger_reference to be compared against h_A_syr2 + // ******************************************************************** + view_stride_adapter<_ViewTypeExpected, true> h_ger_reference( + "h_ger_reference", _M, _N); + Kokkos::deep_copy(h_ger_reference.d_base, A_ger.d_base); + Kokkos::deep_copy(h_ger_reference.h_base, h_ger_reference.d_base); + + std::string uplo = _useUpOption ? "U" : "L"; + for (int i = 0; i < _M; ++i) { + for (int j = 0; j < _N; ++j) { + if (((_useUpOption == true) && (i <= j)) || + ((_useUpOption == false) && (i >= j))) { + // Keep h_ger_reference as already computed + } else { + h_ger_reference.h_view(i, j) = org_A.h_view(i, j); + } + } + } + if (_useHermitianOption && _A_is_complex) { + for (int i(0); i < _N; ++i) { + h_ger_reference.h_view(i, i) = + 0.5 * (h_ger_reference.h_view(i, i) + + _KAT_A::conj(h_ger_reference.h_view(i, i))); + } + } + + // ******************************************************************** + // Compare + // ******************************************************************** + this->compareKkSyr2AgainstReference(alpha, h_A_syr2, h_ger_reference.h_view); +} + +} // namespace Test + +template +#ifdef HAVE_KOKKOSKERNELS_DEBUG +int test_syr2(const std::string& caseName) { + std::cout << "+==============================================================" + "============" + << std::endl; + std::cout << "Starting " << caseName << "..." << std::endl; +#else +int test_syr2(const std::string& /*caseName*/) { +#endif + bool xBool = std::is_same::value || + std::is_same::value || + std::is_same>::value || + std::is_same>::value; + bool yBool = std::is_same::value || + std::is_same::value || + std::is_same>::value || + std::is_same>::value; + bool aBool = std::is_same::value || + std::is_same::value || + std::is_same>::value || + std::is_same>::value; + bool useAnalyticalResults = xBool && yBool && aBool; + +#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; + std::cout << "Starting " << caseName << " for LAYOUTLEFT ..." << std::endl; +#endif + if (true) { + Test::Syr2Tester + tester; + tester.test(0, 0); + tester.test(1, 0); + tester.test(2, 0); + tester.test(13, 0); + tester.test(1024, 0); + + if (useAnalyticalResults) { + tester.test(1024, 0, true, false, false); + tester.test(1024, 0, true, false, true); + tester.test(1024, 0, true, true, false); + tester.test(1024, 0, true, true, true); + } + + tester.test(2, 0, false, false, true); + tester.test(50, 0, false, false, true); + tester.test(2, 0, false, true, false); + tester.test(50, 0, false, true, false); + tester.test(2, 0, false, true, true); + tester.test(50, 0, false, true, true); + + tester.test(50, 4); + tester.test(2131, 0); + } + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Finished " << caseName << " for LAYOUTLEFT" << std::endl; + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; +#endif +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; + std::cout << "Starting " << caseName << " for LAYOUTRIGHT ..." << std::endl; +#endif + if (true) { + Test::Syr2Tester + tester; + tester.test(0, 0); + tester.test(1, 0); + tester.test(2, 0); + tester.test(13, 0); + tester.test(1024, 0); + + if (useAnalyticalResults) { + tester.test(1024, 0, true, false, false); + tester.test(1024, 0, true, false, true); + tester.test(1024, 0, true, true, false); + tester.test(1024, 0, true, true, true); + } + + tester.test(2, 0, false, false, true); + tester.test(50, 0, false, false, true); + tester.test(2, 0, false, true, false); + tester.test(50, 0, false, true, false); + tester.test(2, 0, false, true, true); + tester.test(50, 0, false, true, true); + + tester.test(50, 4); + tester.test(2131, 0); + } + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Finished " << caseName << " for LAYOUTRIGHT" << std::endl; + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; +#endif +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTSTRIDE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; + std::cout << "Starting " << caseName << " for LAYOUTSTRIDE ..." << std::endl; +#endif + if (true) { + Test::Syr2Tester + tester; + tester.test(0, 0); + tester.test(1, 0); + tester.test(2, 0); + tester.test(13, 0); + tester.test(1024, 0); + + if (useAnalyticalResults) { + tester.test(1024, 0, true, false, false); + tester.test(1024, 0, true, false, true); + tester.test(1024, 0, true, true, false); + tester.test(1024, 0, true, true, true); + } + + tester.test(2, 0, false, false, true); + tester.test(50, 0, false, false, true); + tester.test(2, 0, false, true, false); + tester.test(50, 0, false, true, false); + tester.test(2, 0, false, true, true); + tester.test(50, 0, false, true, true); + + tester.test(50, 4); + tester.test(2131, 0); + } + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Finished " << caseName << " for LAYOUTSTRIDE" << std::endl; + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; +#endif +#endif + +#if !defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS) +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; + std::cout << "Starting " << caseName << " for MIXED LAYOUTS ..." << std::endl; +#endif + if (true) { + Test::Syr2Tester + tester; + tester.test(1, 0); + tester.test(2, 0); + tester.test(1024, 0); + + if (useAnalyticalResults) { + tester.test(1024, 0, true, false, true); + tester.test(1024, 0, true, true, true); + } + + tester.test(2, 0, false, false, true); + tester.test(50, 0, false, false, true); + tester.test(2, 0, false, true, true); + tester.test(50, 0, false, true, true); + } + + if (true) { + Test::Syr2Tester + tester; + tester.test(1024, 0); + } + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Finished " << caseName << " for MIXED LAYOUTS" << std::endl; + std::cout << "+--------------------------------------------------------------" + "------------" + << std::endl; +#endif +#endif + +#ifdef HAVE_KOKKOSKERNELS_DEBUG + std::cout << "Finished " << caseName << std::endl; + std::cout << "+==============================================================" + "============" + << std::endl; +#endif + return 1; +} + +#if defined(KOKKOSKERNELS_INST_FLOAT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, syr2_float) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::syr2_float"); + test_syr2("test case syr2_float"); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_COMPLEX_FLOAT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, syr2_complex_float) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::syr2_complex_float"); + test_syr2, Kokkos::complex, + Kokkos::complex, TestDevice>("test case syr2_complex_float"); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_DOUBLE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, syr2_double) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::syr2_double"); + test_syr2("test case syr2_double"); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_COMPLEX_DOUBLE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, syr2_complex_double) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::syr2_complex_double"); + test_syr2, Kokkos::complex, + Kokkos::complex, TestDevice>( + "test case syr2_complex_double"); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_INT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, syr2_int) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::syr2_int"); + test_syr2("test case syr2_int"); + Kokkos::Profiling::popRegion(); +} +#endif + +#if !defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS) +TEST_F(TestCategory, syr2_int_float_double) { + Kokkos::Profiling::pushRegion("KokkosBlas::Test::syr2_int_float_double"); + test_syr2("test case syr2_mixed_types"); + Kokkos::Profiling::popRegion(); +} +#endif diff --git a/cm_generate_makefile.bash b/cm_generate_makefile.bash index 3358ae2eb8..e872789c72 100755 --- a/cm_generate_makefile.bash +++ b/cm_generate_makefile.bash @@ -178,6 +178,7 @@ get_kernels_tpls_list() { KOKKOSKERNELS_USER_TPL_LIBNAME_CMD= CUBLAS_DEFAULT=OFF CUSPARSE_DEFAULT=OFF + CUSOLVER_DEFAULT=OFF ROCBLAS_DEFAULT=OFF ROCSPARSE_DEFAULT=OFF PARSE_TPLS_LIST=$(echo $KOKKOSKERNELS_TPLS | tr "," "\n") @@ -191,6 +192,9 @@ get_kernels_tpls_list() { if [ "$UC_TPLS" == "CUSPARSE" ]; then CUSPARSE_DEFAULT=ON fi + if [ "$UC_TPLS" == "CUSOLVER" ]; then + CUSOLVER_DEFAULT=ON + fi if [ "$UC_TPLS" == "ROCBLAS" ]; then ROCBLAS_DEFAULT=ON fi @@ -224,6 +228,9 @@ get_kernels_tpls_list() { if [ "$CUSPARSE_DEFAULT" == "OFF" ]; then KOKKOSKERNELS_TPLS_CMD="-DKokkosKernels_ENABLE_TPL_CUSPARSE=OFF ${KOKKOSKERNELS_TPLS_CMD}" fi + if [ "$CUSOLVER_DEFAULT" == "OFF" ]; then + KOKKOSKERNELS_TPLS_CMD="-DKokkosKernels_ENABLE_TPL_CUSOLVER=OFF ${KOKKOSKERNELS_TPLS_CMD}" + fi if [ "$ROCBLAS_DEFAULT" == "OFF" ]; then KOKKOSKERNELS_TPLS_CMD="-DKokkosKernels_ENABLE_TPL_ROCBLAS=OFF ${KOKKOSKERNELS_TPLS_CMD}" fi @@ -320,7 +327,6 @@ display_help_text() { echo "--with-gtest=/Path/To/Gtest: Set path to gtest. (Used in unit and performance" echo " tests.)" echo "--with-hwloc=/Path/To/Hwloc: Set path to hwloc library." - echo "--with-memkind=/Path/To/MemKind: Set path to memkind library." echo "--with-options=[OPT]: Additional options to Kokkos:" echo " compiler_warnings" echo " aggressive_vectorization = add ivdep on loops" @@ -487,10 +493,6 @@ do KOKKOS_HWLOC=ON HWLOC_PATH="${key#*=}" ;; - --with-memkind*) - KOKKOS_MEMKIND=ON - MEMKIND_PATH="${key#*=}" - ;; --arch*) KOKKOS_ARCH="${key#*=}" ;; @@ -710,15 +712,6 @@ else KOKKOS_HWLOC_CMD= fi -if [ "$KOKKOS_MEMKIND" == "ON" ]; then - KOKKOS_MEMKIND_CMD=-DKokkos_ENABLE_MEMKIND=ON - if [ "$MEMKIND_PATH" != "" ]; then - KOKKOS_MEMKIND_PATH_CMD=-DMEMKIND_ROOT=$MEMKIND_PATH - fi -else - KOKKOS_MEMKIND_CMD= -fi - # Currently assumes script is in base kokkos-kernels directory if [ ! -e ${KOKKOSKERNELS_PATH}/CMakeLists.txt ]; then @@ -811,9 +804,9 @@ cd ${KOKKOS_INSTALL_PATH} # Configure kokkos echo "" -echo cmake $COMPILER_CMD -DCMAKE_CXX_FLAGS="${KOKKOS_CXXFLAGS}" -DCMAKE_EXE_LINKER_FLAGS="${KOKKOS_LDFLAGS}" -DCMAKE_INSTALL_PREFIX=${KOKKOS_INSTALL_PATH} ${KOKKOS_DEVICE_CMD} ${KOKKOS_ARCH_CMD} -DKokkos_ENABLE_TESTS=${KOKKOS_DO_TESTS} -DKokkos_ENABLE_EXAMPLES=${KOKKOS_DO_EXAMPLES} ${KOKKOS_OPTION_CMD} ${KOKKOS_CUDA_OPTION_CMD} ${KOKKOS_HIP_OPTION_CMD} -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF ${STANDARD_CMD} ${KOKKOS_BUILDTYPE_CMD} -DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBRARIES} ${KOKKOS_BC_CMD} ${KOKKOS_HWLOC_CMD} ${KOKKOS_HWLOC_PATH_CMD} ${KOKKOS_MEMKIND_CMD} ${KOKKOS_MEMKIND_PATH_CMD} -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=${KOKKOS_DEPRECATED_CODE} -DKokkos_ENABLE_DEPRECATION_WARNINGS=${KOKKOS_DEPRECATED_CODE_WARNINGS} ${KOKKOS_PASSTHRU_CMAKE_FLAGS} ${KOKKOS_PATH} +echo cmake $COMPILER_CMD -DCMAKE_CXX_FLAGS="${KOKKOS_CXXFLAGS}" -DCMAKE_EXE_LINKER_FLAGS="${KOKKOS_LDFLAGS}" -DCMAKE_INSTALL_PREFIX=${KOKKOS_INSTALL_PATH} ${KOKKOS_DEVICE_CMD} ${KOKKOS_ARCH_CMD} -DKokkos_ENABLE_TESTS=${KOKKOS_DO_TESTS} -DKokkos_ENABLE_EXAMPLES=${KOKKOS_DO_EXAMPLES} ${KOKKOS_OPTION_CMD} ${KOKKOS_CUDA_OPTION_CMD} ${KOKKOS_HIP_OPTION_CMD} -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF ${STANDARD_CMD} ${KOKKOS_BUILDTYPE_CMD} -DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBRARIES} ${KOKKOS_BC_CMD} ${KOKKOS_HWLOC_CMD} ${KOKKOS_HWLOC_PATH_CMD} -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=${KOKKOS_DEPRECATED_CODE} -DKokkos_ENABLE_DEPRECATION_WARNINGS=${KOKKOS_DEPRECATED_CODE_WARNINGS} ${KOKKOS_PASSTHRU_CMAKE_FLAGS} ${KOKKOS_PATH} echo "" -cmake $COMPILER_CMD -DCMAKE_CXX_FLAGS="${KOKKOS_CXXFLAGS//\"}" -DCMAKE_EXE_LINKER_FLAGS="${KOKKOS_LDFLAGS//\"}" -DCMAKE_INSTALL_PREFIX=${KOKKOS_INSTALL_PATH} ${KOKKOS_DEVICE_CMD} ${KOKKOS_ARCH_CMD} -DKokkos_ENABLE_TESTS=${KOKKOS_DO_TESTS} -DKokkos_ENABLE_EXAMPLES=${KOKKOS_DO_EXAMPLES} ${KOKKOS_OPTION_CMD} ${KOKKOS_CUDA_OPTION_CMD} ${KOKKOS_HIP_OPTION_CMD} -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF ${STANDARD_CMD} ${KOKKOS_BUILDTYPE_CMD} -DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBRARIES} ${KOKKOS_BC_CMD} ${KOKKOS_HWLOC_CMD} ${KOKKOS_HWLOC_PATH_CMD} ${KOKKOS_MEMKIND_CMD} ${KOKKOS_MEMKIND_PATH_CMD} -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=${KOKKOS_DEPRECATED_CODE} -DKokkos_ENABLE_DEPRECATION_WARNINGS=${KOKKOS_DEPRECATED_CODE_WARNINGS} ${KOKKOS_PASSTHRU_CMAKE_FLAGS} ${KOKKOS_PATH} +cmake $COMPILER_CMD -DCMAKE_CXX_FLAGS="${KOKKOS_CXXFLAGS//\"}" -DCMAKE_EXE_LINKER_FLAGS="${KOKKOS_LDFLAGS//\"}" -DCMAKE_INSTALL_PREFIX=${KOKKOS_INSTALL_PATH} ${KOKKOS_DEVICE_CMD} ${KOKKOS_ARCH_CMD} -DKokkos_ENABLE_TESTS=${KOKKOS_DO_TESTS} -DKokkos_ENABLE_EXAMPLES=${KOKKOS_DO_EXAMPLES} ${KOKKOS_OPTION_CMD} ${KOKKOS_CUDA_OPTION_CMD} ${KOKKOS_HIP_OPTION_CMD} -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF ${STANDARD_CMD} ${KOKKOS_BUILDTYPE_CMD} -DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBRARIES} ${KOKKOS_BC_CMD} ${KOKKOS_HWLOC_CMD} ${KOKKOS_HWLOC_PATH_CMD} -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=${KOKKOS_DEPRECATED_CODE} -DKokkos_ENABLE_DEPRECATION_WARNINGS=${KOKKOS_DEPRECATED_CODE_WARNINGS} ${KOKKOS_PASSTHRU_CMAKE_FLAGS} ${KOKKOS_PATH} # Install kokkos library make install -j $KOKKOS_MAKEINSTALL_J diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake index 777d4445b3..a52f0c098c 100644 --- a/cmake/Dependencies.cmake +++ b/cmake/Dependencies.cmake @@ -1,7 +1,7 @@ TRIBITS_PACKAGE_DEFINE_DEPENDENCIES( LIB_REQUIRED_PACKAGES Kokkos - LIB_OPTIONAL_TPLS quadmath MKL BLAS LAPACK CUSPARSE METIS SuperLU Cholmod CUBLAS ROCBLAS ROCSPARSE - TEST_OPTIONAL_TPLS yaml-cpp + LIB_OPTIONAL_TPLS quadmath MKL BLAS LAPACK METIS SuperLU Cholmod CUBLAS CUSPARSE CUSOLVER ROCBLAS ROCSPARSE ROCSOLVER + TEST_OPTIONAL_TPLS yamlcpp ) # NOTE: If you update names in LIB_OPTIONAL_TPLS above, make sure to map those names in # the macro 'KOKKOSKERNELS_ADD_TPL_OPTION' that resides in cmake/kokkoskernels_tpls.cmake. diff --git a/cmake/KokkosKernels_config.h.in b/cmake/KokkosKernels_config.h.in index d94860e380..ef8fea78b8 100644 --- a/cmake/KokkosKernels_config.h.in +++ b/cmake/KokkosKernels_config.h.in @@ -53,6 +53,7 @@ /* Whether to build kernels for execution space Kokkos::HIP */ #cmakedefine KOKKOSKERNELS_INST_EXECSPACE_HIP #cmakedefine KOKKOSKERNELS_INST_MEMSPACE_HIPSPACE +#cmakedefine KOKKOSKERNELS_INST_MEMSPACE_HIPMANAGEDSPACE /* Whether to build kernels for execution space Kokkos::Experimental::SYCL */ #cmakedefine KOKKOSKERNELS_INST_EXECSPACE_SYCL #cmakedefine KOKKOSKERNELS_INST_MEMSPACE_SYCLSPACE @@ -114,10 +115,12 @@ #cmakedefine KOKKOSKERNELS_ENABLE_TPL_LAPACK /* MKL library */ #cmakedefine KOKKOSKERNELS_ENABLE_TPL_MKL -/* CUSPARSE */ -#cmakedefine KOKKOSKERNELS_ENABLE_TPL_CUSPARSE /* CUBLAS */ #cmakedefine KOKKOSKERNELS_ENABLE_TPL_CUBLAS +/* CUSPARSE */ +#cmakedefine KOKKOSKERNELS_ENABLE_TPL_CUSPARSE +/* CUSOLVER */ +#cmakedefine KOKKOSKERNELS_ENABLE_TPL_CUSOLVER /* MAGMA */ #cmakedefine KOKKOSKERNELS_ENABLE_TPL_MAGMA /* SuperLU */ @@ -138,6 +141,8 @@ #cmakedefine KOKKOSKERNELS_ENABLE_TPL_ROCBLAS /* ROCSPARSE */ #cmakedefine KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE +/* ROCSOLVER */ +#cmakedefine KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER #cmakedefine KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV diff --git a/cmake/Modules/FindTPLCUBLAS.cmake b/cmake/Modules/FindTPLCUBLAS.cmake index 890c2dac62..164f3bf4c4 100644 --- a/cmake/Modules/FindTPLCUBLAS.cmake +++ b/cmake/Modules/FindTPLCUBLAS.cmake @@ -1,18 +1,47 @@ -FIND_PACKAGE(CUDA) - -INCLUDE(FindPackageHandleStandardArgs) -IF (NOT CUDA_FOUND) - #Important note here: this find Module is named TPLCUBLAS - #The eventual target is named CUBLAS. To avoid naming conflicts - #the find module is called TPLCUBLAS. This call will cause - #the find_package call to fail in a "standard" CMake way - FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUBLAS REQUIRED_VARS CUDA_FOUND) -ELSE() - #The libraries might be empty - OR they might explicitly be not found - IF("${CUDA_CUBLAS_LIBRARIES}" MATCHES "NOTFOUND") - FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUBLAS REQUIRED_VARS CUDA_CUBLAS_LIBRARIES) +if(CUBLAS_LIBRARIES AND CUBLAS_LIBRARY_DIRS AND CUBLAS_INCLUDE_DIRS) + kokkoskernels_find_imported(CUBLAS INTERFACE + LIBRARIES ${CUBLAS_LIBRARIES} + LIBRARY_PATHS ${CUBLAS_LIBRARY_DIRS} + HEADER_PATHS ${CUBLAS_INCLUDE_DIRS} + ) +elseif(CUBLAS_LIBRARIES AND CUBLAS_LIBRARY_DIRS) + kokkoskernels_find_imported(CUBLAS INTERFACE + LIBRARIES ${CUBLAS_LIBRARIES} + LIBRARY_PATHS ${CUBLAS_LIBRARY_DIRS} + HEADER cublas.h + ) +elseif(CUBLAS_LIBRARIES) + kokkoskernels_find_imported(CUBLAS INTERFACE + LIBRARIES ${CUBLAS_LIBRARIES} + HEADER cublas.h + ) +elseif(CUBLAS_LIBRARY_DIRS) + kokkoskernels_find_imported(CUBLAS INTERFACE + LIBRARIES cublas + LIBRARY_PATHS ${CUBLAS_LIBRARY_DIRS} + HEADER cublas.h + ) +elseif(CUBLAS_ROOT OR KokkosKernels_CUBLAS_ROOT) # nothing specific provided, just ROOT + kokkoskernels_find_imported(CUBLAS INTERFACE + LIBRARIES cublas + HEADER cublas.h + ) +else() # backwards-compatible way + FIND_PACKAGE(CUDA) + INCLUDE(FindPackageHandleStandardArgs) + IF (NOT CUDA_FOUND) + #Important note here: this find Module is named TPLCUBLAS + #The eventual target is named CUBLAS. To avoid naming conflicts + #the find module is called TPLCUBLAS. This call will cause + #the find_package call to fail in a "standard" CMake way + FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUBLAS REQUIRED_VARS CUDA_FOUND) ELSE() - KOKKOSKERNELS_CREATE_IMPORTED_TPL(CUBLAS INTERFACE - LINK_LIBRARIES "${CUDA_CUBLAS_LIBRARIES}") + #The libraries might be empty - OR they might explicitly be not found + IF("${CUDA_CUBLAS_LIBRARIES}" MATCHES "NOTFOUND") + FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUBLAS REQUIRED_VARS CUDA_CUBLAS_LIBRARIES) + ELSE() + KOKKOSKERNELS_CREATE_IMPORTED_TPL(CUBLAS INTERFACE + LINK_LIBRARIES "${CUDA_CUBLAS_LIBRARIES}") + ENDIF() ENDIF() -ENDIF() +endif() diff --git a/cmake/Modules/FindTPLCUSOLVER.cmake b/cmake/Modules/FindTPLCUSOLVER.cmake new file mode 100644 index 0000000000..3e43639495 --- /dev/null +++ b/cmake/Modules/FindTPLCUSOLVER.cmake @@ -0,0 +1,46 @@ +if(CUSOLVER_LIBRARIES AND CUSOLVER_LIBRARY_DIRS AND CUSOLVER_INCLUDE_DIRS) + kokkoskernels_find_imported(CUSOLVER INTERFACE + LIBRARIES ${CUSOLVER_LIBRARIES} + LIBRARY_PATHS ${CUSOLVER_LIBRARY_DIRS} + HEADER_PATHS ${CUSOLVER_INCLUDE_DIRS} + ) +elseif(CUSOLVER_LIBRARIES AND CUSOLVER_LIBRARY_DIRS) + kokkoskernels_find_imported(CUSOLVER INTERFACE + LIBRARIES ${CUSOLVER_LIBRARIES} + LIBRARY_PATHS ${CUSOLVER_LIBRARY_DIRS} + HEADER cusolverDn.h + ) +elseif(CUSOLVER_LIBRARIES) + kokkoskernels_find_imported(CUSOLVER INTERFACE + LIBRARIES ${CUSOLVER_LIBRARIES} + HEADER cusolverDn.h + ) +elseif(CUSOLVER_LIBRARY_DIRS) + kokkoskernels_find_imported(CUSOLVER INTERFACE + LIBRARIES cusolver + LIBRARY_PATHS ${CUSOLVER_LIBRARY_DIRS} + HEADER cusolverDn.h + ) +elseif(CUSOLVER_ROOT OR KokkosKernels_CUSOLVER_ROOT) # nothing specific provided, just ROOT + kokkoskernels_find_imported(CUSOLVER INTERFACE + LIBRARIES cusolver + HEADER cusolverDn.h + ) +else() # backwards-compatible way + FIND_PACKAGE(CUDA) + INCLUDE(FindPackageHandleStandardArgs) + IF (NOT CUDA_FOUND) + #Important note here: this find Module is named TPLCUSOLVER + #The eventual target is named CUSOLVER. To avoid naming conflicts + #the find module is called TPLCUSOLVER. This call will cause + #the find_package call to fail in a "standard" CMake way + FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUSOLVER REQUIRED_VARS CUDA_FOUND) + ELSE() + #The libraries might be empty - OR they might explicitly be not found + IF("${CUDA_cusolver_LIBRARY}" MATCHES "NOTFOUND") + FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUSOLVER REQUIRED_VARS CUDA_cusolver_LIBRARY) + ELSE() + KOKKOSKERNELS_CREATE_IMPORTED_TPL(CUSOLVER INTERFACE LINK_LIBRARIES "${CUDA_cusolver_LIBRARY}") + ENDIF() + ENDIF() +endif() diff --git a/cmake/Modules/FindTPLCUSPARSE.cmake b/cmake/Modules/FindTPLCUSPARSE.cmake index f6e02129ae..6302f85d78 100644 --- a/cmake/Modules/FindTPLCUSPARSE.cmake +++ b/cmake/Modules/FindTPLCUSPARSE.cmake @@ -1,17 +1,46 @@ -FIND_PACKAGE(CUDA) - -INCLUDE(FindPackageHandleStandardArgs) -IF (NOT CUDA_FOUND) - #Important note here: this find Module is named TPLCUSPARSE - #The eventual target is named CUSPARSE. To avoid naming conflicts - #the find module is called TPLCUSPARSE. This call will cause - #the find_package call to fail in a "standard" CMake way - FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUSPARSE REQUIRED_VARS CUDA_FOUND) -ELSE() - #The libraries might be empty - OR they might explicitly be not found - IF("${CUDA_cusparse_LIBRARY}" MATCHES "NOTFOUND") - FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUSPARSE REQUIRED_VARS CUDA_cusparse_LIBRARY) +if(CUSPARSE_LIBRARIES AND CUSPARSE_LIBRARY_DIRS AND CUSPARSE_INCLUDE_DIRS) + kokkoskernels_find_imported(CUSPARSE INTERFACE + LIBRARIES ${CUSPARSE_LIBRARIES} + LIBRARY_PATHS ${CUSPARSE_LIBRARY_DIRS} + HEADER_PATHS ${CUSPARSE_INCLUDE_DIRS} + ) +elseif(CUSPARSE_LIBRARIES AND CUSPARSE_LIBRARY_DIRS) + kokkoskernels_find_imported(CUSPARSE INTERFACE + LIBRARIES ${CUSPARSE_LIBRARIES} + LIBRARY_PATHS ${CUSPARSE_LIBRARY_DIRS} + HEADER cusparse.h + ) +elseif(CUSPARSE_LIBRARIES) + kokkoskernels_find_imported(CUSPARSE INTERFACE + LIBRARIES ${CUSPARSE_LIBRARIES} + HEADER cusparse.h + ) +elseif(CUSPARSE_LIBRARY_DIRS) + kokkoskernels_find_imported(CUSPARSE INTERFACE + LIBRARIES cusparse + LIBRARY_PATHS ${CUSPARSE_LIBRARY_DIRS} + HEADER cusparse.h + ) +elseif(CUSPARSE_ROOT OR KokkosKernels_CUSPARSE_ROOT) # nothing specific provided, just ROOT + kokkoskernels_find_imported(CUSPARSE INTERFACE + LIBRARIES cusparse + HEADER cusparse.h + ) +else() # backwards-compatible way + FIND_PACKAGE(CUDA) + INCLUDE(FindPackageHandleStandardArgs) + IF (NOT CUDA_FOUND) + #Important note here: this find Module is named TPLCUSPARSE + #The eventual target is named CUSPARSE. To avoid naming conflicts + #the find module is called TPLCUSPARSE. This call will cause + #the find_package call to fail in a "standard" CMake way + FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUSPARSE REQUIRED_VARS CUDA_FOUND) ELSE() - KOKKOSKERNELS_CREATE_IMPORTED_TPL(CUSPARSE LIBRARY ${CUDA_cusparse_LIBRARY}) + #The libraries might be empty - OR they might explicitly be not found + IF("${CUDA_cusparse_LIBRARY}" MATCHES "NOTFOUND") + FIND_PACKAGE_HANDLE_STANDARD_ARGS(TPLCUSPARSE REQUIRED_VARS CUDA_cusparse_LIBRARY) + ELSE() + KOKKOSKERNELS_CREATE_IMPORTED_TPL(CUSPARSE INTERFACE LINK_LIBRARIES "${CUDA_cusparse_LIBRARY}") + ENDIF() ENDIF() -ENDIF() +endif() diff --git a/cmake/Modules/FindTPLROCSOLVER.cmake b/cmake/Modules/FindTPLROCSOLVER.cmake new file mode 100644 index 0000000000..8f2a92cfda --- /dev/null +++ b/cmake/Modules/FindTPLROCSOLVER.cmake @@ -0,0 +1,9 @@ +# LBV: 11/08/2023: This file follows the partern of FindTPLROCBLAS.cmake/FindTPLROCSPARSE.cmake +FIND_PACKAGE(ROCSOLVER) +if(TARGET roc::rocsolver) + SET(TPL_ROCSOLVER_IMPORTED_NAME roc::rocsolver) + SET(TPL_IMPORTED_NAME roc::rocsolver) + ADD_LIBRARY(KokkosKernels::ROCSOLVER ALIAS roc::rocsolver) +ELSE() + MESSAGE(FATAL_ERROR "Package ROCSOLVER requested but not found") +ENDIF() diff --git a/cmake/kokkoskernels_components.cmake b/cmake/kokkoskernels_components.cmake index 49bc2f4ae6..16a784bd1f 100644 --- a/cmake/kokkoskernels_components.cmake +++ b/cmake/kokkoskernels_components.cmake @@ -102,4 +102,4 @@ IF ( KokkosKernels_ENABLE_COMPONENT_BATCHED ELSE() SET(KOKKOSKERNELS_ALL_COMPONENTS_ENABLED OFF CACHE BOOL "" FORCE) ENDIF() -mark_as_advanced(FORCE KOKKOSKERNELS_ALL_COMPONENTS_ENABLED) +mark_as_advanced(FORCE KOKKOSKERNELS_ALL_COMPONENTS_ENABLED) \ No newline at end of file diff --git a/cmake/kokkoskernels_eti_devices.cmake b/cmake/kokkoskernels_eti_devices.cmake index 8c6cb540ae..8c38be098c 100644 --- a/cmake/kokkoskernels_eti_devices.cmake +++ b/cmake/kokkoskernels_eti_devices.cmake @@ -23,20 +23,20 @@ SET(MEM_SPACES MEMSPACE_CUDASPACE MEMSPACE_CUDAUVMSPACE MEMSPACE_HIPSPACE + MEMSPACE_HIPMANAGEDSPACE MEMSPACE_SYCLSPACE MEMSPACE_SYCLSHAREDSPACE MEMSPACE_OPENMPTARGET MEMSPACE_HOSTSPACE - MEMSPACE_HBWSPACE ) SET(MEMSPACE_CUDASPACE_CPP_TYPE Kokkos::CudaSpace) SET(MEMSPACE_CUDAUVMSPACE_CPP_TYPE Kokkos::CudaUVMSpace) SET(MEMSPACE_HIPSPACE_CPP_TYPE Kokkos::HIPSpace) +SET(MEMSPACE_HIPMANAGEDSPACE_CPP_TYPE Kokkos::HIPManagedSpace) SET(MEMSPACE_SYCLSPACE_CPP_TYPE Kokkos::Experimental::SYCLDeviceUSMSpace) SET(MEMSPACE_SYCLSHAREDSPACE_CPP_TYPE Kokkos::Experimental::SYCLSharedUSMSpace) SET(MEMSPACE_OPENMPTARGETSPACE_CPP_TYPE Kokkos::Experimental::OpenMPTargetSpace) SET(MEMSPACE_HOSTSPACE_CPP_TYPE Kokkos::HostSpace) -SET(MEMSPACE_HBWSPACE_CPP_TYPE Kokkos::HBWSpace) IF(KOKKOS_ENABLE_CUDA) KOKKOSKERNELS_ADD_OPTION( @@ -85,10 +85,19 @@ IF(KOKKOS_ENABLE_HIP) BOOL "Whether to pre instantiate kernels for the memory space Kokkos::HIPSpace. Disabling this when Kokkos_ENABLE_HIP is enabled may increase build times. Default: ON if Kokkos is HIP-enabled, OFF otherwise." ) + KOKKOSKERNELS_ADD_OPTION( + INST_MEMSPACE_HIPMANAGEDSPACE + OFF + BOOL + "Whether to pre instantiate kernels for the memory space Kokkos::HIPManagedSpace. Disabling this when Kokkos_ENABLE_HIP is enabled may increase build times. Default: OFF." + ) IF(KOKKOSKERNELS_INST_EXECSPACE_HIP AND KOKKOSKERNELS_INST_MEMSPACE_HIPSPACE) LIST(APPEND DEVICE_LIST "") ENDIF() + IF(KOKKOSKERNELS_INST_EXECSPACE_HIP AND KOKKOSKERNELS_INST_MEMSPACE_HIPMANAGEDSPACE) + LIST(APPEND DEVICE_LIST "") + ENDIF() IF( Trilinos_ENABLE_COMPLEX_DOUBLE AND ((NOT DEFINED CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS) OR (NOT CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS)) ) MESSAGE( WARNING "The CMake option CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS is either undefined or OFF. Please set CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL=ON when building with HIP and complex double enabled.") @@ -152,13 +161,6 @@ KOKKOSKERNELS_ADD_OPTION( "Whether to pre instantiate kernels for the memory space Kokkos::HostSpace. Disabling this when one of the Host execution spaces is enabled may increase build times. Default: ON" ) -KOKKOSKERNELS_ADD_OPTION( - INST_MEMSPACE_HBWSPACE - OFF - BOOL - "Whether to pre instantiate kernels for the memory space Kokkos::HBWSpace." -) - KOKKOSKERNELS_ADD_OPTION( INST_EXECSPACE_OPENMP ${KOKKOSKERNELS_INST_EXECSPACE_OPENMP_DEFAULT} @@ -197,12 +199,12 @@ KOKKOSKERNELS_ADD_OPTION( ) SET(EXECSPACE_CUDA_VALID_MEM_SPACES CUDASPACE CUDAUVMSPACE) -SET(EXECSPACE_HIP_VALID_MEM_SPACES HIPSPACE) +SET(EXECSPACE_HIP_VALID_MEM_SPACES HIPSPACE HIPMANAGEDSPACE) SET(EXECSPACE_SYCL_VALID_MEM_SPACES SYCLSPACE SYCLSHAREDSPACE) SET(EXECSPACE_OPENMPTARGET_VALID_MEM_SPACES OPENMPTARGETSPACE) -SET(EXECSPACE_SERIAL_VALID_MEM_SPACES HBWSPACE HOSTSPACE) -SET(EXECSPACE_OPENMP_VALID_MEM_SPACES HBWSPACE HOSTSPACE) -SET(EXECSPACE_THREADS_VALID_MEM_SPACES HBWSPACE HOSTSPACE) +SET(EXECSPACE_SERIAL_VALID_MEM_SPACES HOSTSPACE) +SET(EXECSPACE_OPENMP_VALID_MEM_SPACES HOSTSPACE) +SET(EXECSPACE_THREADS_VALID_MEM_SPACES HOSTSPACE) SET(DEVICES) FOREACH(EXEC ${EXEC_SPACES}) IF (KOKKOSKERNELS_INST_${EXEC}) diff --git a/cmake/kokkoskernels_features.cmake b/cmake/kokkoskernels_features.cmake index aacc1c8451..211c0c740e 100644 --- a/cmake/kokkoskernels_features.cmake +++ b/cmake/kokkoskernels_features.cmake @@ -27,3 +27,38 @@ IF (KOKKOSKERNELS_ENABLE_TPL_BLAS OR KOKKOSKERNELS_ENABLE_TPL_MKL OR KOKKOSKERNE INCLUDE(CheckHostBlasReturnComplex.cmake) CHECK_HOST_BLAS_RETURN_COMPLEX(KOKKOSKERNELS_TPL_BLAS_RETURN_COMPLEX) ENDIF() + +# ================================================================== +# Lapack requirements +# ================================================================== + +IF (KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_ROCBLAS AND NOT KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE) + MESSAGE(FATAL_ERROR "rocSOLVER requires rocBLAS and rocSPARSE, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_ROCBLAS:BOOL=ON and KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE:BOOL=ON.") +ELSEIF (KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE) + MESSAGE(FATAL_ERROR "rocSOLVER requires rocSPARSE, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE:BOOL=ON.") +ELSEIF (KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_ROCBLAS) + MESSAGE(FATAL_ERROR "rocSOLVER requires rocBLAS, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_ROCBLAS:BOOL=ON.") +ENDIF() + +# TPL_ENABLE_CUDA default enables CUBLAS and CUSOLVER in Trilinos, but not CUSPARSE. CUSPARSE is a required TPL for CUSOLVER support in KokkosKernels. +IF (KOKKOSKERNELS_HAS_TRILINOS AND TPL_ENABLE_CUDA) + # Checks disable CUSOLVER in KokkosKernels if TPL dependency requirements are not met. This is a compatibility workaround to allow existing configuration options for Trilinos to continue working. + IF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_CUBLAS AND NOT KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) + MESSAGE(WARNING "cuSOLVER requires cuBLAS and cuSPARSE, disabling cuSOLVER. To use cuSOLVER, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_CUBLAS:BOOL=ON and KOKKOSKERNELS_ENABLE_TPL_CUSPARSE:BOOL=ON to use.") + SET(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER OFF CACHE BOOL "Disabling KOKKOSKERNELS_ENABLE_TPL_CUSOLVER - this capability requires both CUBLAS and CUSPARSE TPLs" FORCE) + ELSEIF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) + MESSAGE(WARNING "cuSOLVER requires cuSPARSE, disabling cuSOLVER. To use cuSOLVER, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_CUSPARSE:BOOL=ON to use.") + SET(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER OFF CACHE BOOL "Disabling KOKKOSKERNELS_ENABLE_TPL_CUSOLVER - this capability requires both CUBLAS and CUSPARSE TPLs" FORCE) + ELSEIF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_CUBLAS) + MESSAGE(WARNING "cuSOLVER requires cuBLAS, disabling cuSOLVER. To use cuSOLVER, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_CUBLAS:BOOL=ON to use.") + SET(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER OFF CACHE BOOL "Disabling KOKKOSKERNELS_ENABLE_TPL_CUSOLVER - this capability requires both CUBLAS and CUSPARSE TPLs" FORCE) + ENDIF() +ELSE() + IF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_CUBLAS AND NOT KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) + MESSAGE(FATAL_ERROR "cuSOLVER requires cuBLAS and cuSPARSE, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_CUBLAS:BOOL=ON and KOKKOSKERNELS_ENABLE_TPL_CUSPARSE:BOOL=ON.") + ELSEIF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) + MESSAGE(FATAL_ERROR "cuSOLVER requires cuSPARSE, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_CUSPARSE:BOOL=ON.") + ELSEIF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER AND NOT KOKKOSKERNELS_ENABLE_TPL_CUBLAS) + MESSAGE(FATAL_ERROR "cuSOLVER requires cuBLAS, please reconfigure with KOKKOSKERNELS_ENABLE_TPL_CUBLAS:BOOL=ON.") + ENDIF() +ENDIF() diff --git a/cmake/kokkoskernels_tpls.cmake b/cmake/kokkoskernels_tpls.cmake index 08c7158148..d1a44721e6 100644 --- a/cmake/kokkoskernels_tpls.cmake +++ b/cmake/kokkoskernels_tpls.cmake @@ -447,28 +447,35 @@ ENDIF() KOKKOSKERNELS_ADD_OPTION(NO_DEFAULT_CUDA_TPLS OFF BOOL "Whether CUDA TPLs should be enabled by default. Default: OFF") SET(CUBLAS_DEFAULT ${KOKKOS_ENABLE_CUDA}) SET(CUSPARSE_DEFAULT ${KOKKOS_ENABLE_CUDA}) +SET(CUSOLVER_DEFAULT ${KOKKOS_ENABLE_CUDA}) IF(KOKKOSKERNELS_NO_DEFAULT_CUDA_TPLS) SET(CUBLAS_DEFAULT OFF) SET(CUSPARSE_DEFAULT OFF) + SET(CUSOLVER_DEFAULT OFF) ENDIF() KOKKOSKERNELS_ADD_TPL_OPTION(CUBLAS ${CUBLAS_DEFAULT} "Whether to enable CUBLAS" DEFAULT_DOCSTRING "ON if CUDA-enabled Kokkos, otherwise OFF") KOKKOSKERNELS_ADD_TPL_OPTION(CUSPARSE ${CUSPARSE_DEFAULT} "Whether to enable CUSPARSE" DEFAULT_DOCSTRING "ON if CUDA-enabled Kokkos, otherwise OFF") +KOKKOSKERNELS_ADD_TPL_OPTION(CUSOLVER ${CUSOLVER_DEFAULT} "Whether to enable CUSOLVER" + DEFAULT_DOCSTRING "ON if CUDA-enabled Kokkos, otherwise OFF") KOKKOSKERNELS_ADD_OPTION(NO_DEFAULT_ROCM_TPLS OFF BOOL "Whether ROCM TPLs should be enabled by default. Default: OFF") # Unlike CUDA, ROCm does not automatically install these TPLs SET(ROCBLAS_DEFAULT OFF) SET(ROCSPARSE_DEFAULT OFF) +SET(ROCSOLVER_DEFAULT OFF) # Since the default is OFF we do not really need this piece of logic here. # IF(KOKKOSKERNELS_NO_DEFAULT_ROCM_TPLS) # SET(ROCBLAS_DEFAULT OFF) # SET(ROCSPARSE_DEFAULT OFF) # ENDIF() KOKKOSKERNELS_ADD_TPL_OPTION(ROCBLAS ${ROCBLAS_DEFAULT} "Whether to enable ROCBLAS" - DEFAULT_DOCSTRING "ON if HIP-enabled Kokkos, otherwise OFF") + DEFAULT_DOCSTRING "OFF even if HIP-enabled Kokkos") KOKKOSKERNELS_ADD_TPL_OPTION(ROCSPARSE ${ROCSPARSE_DEFAULT} "Whether to enable ROCSPARSE" - DEFAULT_DOCSTRING "ON if HIP-enabled Kokkos, otherwise OFF") + DEFAULT_DOCSTRING "OFF even if HIP-enabled Kokkos") +KOKKOSKERNELS_ADD_TPL_OPTION(ROCSOLVER ${ROCSOLVER_DEFAULT} "Whether to enable ROCSOLVER" + DEFAULT_DOCSTRING "OFF even if HIP-enabled Kokkos") IF (KOKKOSKERNELS_ENABLE_TPL_MAGMA) IF (F77_BLAS_MANGLE STREQUAL "(name,NAME) name ## _") @@ -498,6 +505,7 @@ IF (NOT KOKKOSKERNELS_HAS_TRILINOS) KOKKOSKERNELS_IMPORT_TPL(MKL) KOKKOSKERNELS_IMPORT_TPL(CUBLAS) KOKKOSKERNELS_IMPORT_TPL(CUSPARSE) + KOKKOSKERNELS_IMPORT_TPL(CUSOLVER) KOKKOSKERNELS_IMPORT_TPL(CBLAS) KOKKOSKERNELS_IMPORT_TPL(LAPACKE) KOKKOSKERNELS_IMPORT_TPL(CHOLMOD) @@ -507,6 +515,7 @@ IF (NOT KOKKOSKERNELS_HAS_TRILINOS) KOKKOSKERNELS_IMPORT_TPL(MAGMA) KOKKOSKERNELS_IMPORT_TPL(ROCBLAS) KOKKOSKERNELS_IMPORT_TPL(ROCSPARSE) + KOKKOSKERNELS_IMPORT_TPL(ROCSOLVER) ELSE () IF (Trilinos_ENABLE_SuperLU5_API) SET(HAVE_KOKKOSKERNELS_SUPERLU5_API TRUE) diff --git a/common/impl/KokkosKernels_ViewUtils.hpp b/common/impl/KokkosKernels_ViewUtils.hpp index ac4abb6457..2ae8fb609d 100644 --- a/common/impl/KokkosKernels_ViewUtils.hpp +++ b/common/impl/KokkosKernels_ViewUtils.hpp @@ -19,11 +19,6 @@ #include "Kokkos_Core.hpp" namespace KokkosKernels::Impl { -// lbv - 07/26/2023: -// MemoryTraits::impl_value was added -// in Kokkos 4.1.00 so we should guard -// the content of this header until v4.3.0 -#if KOKKOS_VERSION >= 40100 || defined(DOXY) /*! \brief Yields a type that is View with Kokkos::Unmanaged added to the memory * traits @@ -59,7 +54,6 @@ auto make_unmanaged(const View &v) { return typename with_unmanaged::type(v); } -#endif // KOKKOS_VERSION >= 40100 } // namespace KokkosKernels::Impl #endif diff --git a/common/src/KokkosKernels_ExecSpaceUtils.hpp b/common/src/KokkosKernels_ExecSpaceUtils.hpp index 2ec09f4069..4d3a3002b4 100644 --- a/common/src/KokkosKernels_ExecSpaceUtils.hpp +++ b/common/src/KokkosKernels_ExecSpaceUtils.hpp @@ -215,10 +215,21 @@ inline void kk_get_free_total_memory(size_t& free_mem, total_mem /= n_streams; } template <> +inline void kk_get_free_total_memory(size_t& free_mem, + size_t& total_mem, + int n_streams) { + kk_get_free_total_memory(free_mem, total_mem, n_streams); +} +template <> inline void kk_get_free_total_memory(size_t& free_mem, size_t& total_mem) { kk_get_free_total_memory(free_mem, total_mem, 1); } +template <> +inline void kk_get_free_total_memory( + size_t& free_mem, size_t& total_mem) { + kk_get_free_total_memory(free_mem, total_mem, 1); +} #endif // FIXME_SYCL Use compiler extension instead of low level interface when diff --git a/common/src/KokkosKernels_PrintConfiguration.hpp b/common/src/KokkosKernels_PrintConfiguration.hpp index cd2333b3ec..c2e3a5187f 100644 --- a/common/src/KokkosKernels_PrintConfiguration.hpp +++ b/common/src/KokkosKernels_PrintConfiguration.hpp @@ -44,6 +44,18 @@ inline void print_cusparse_version_if_enabled(std::ostream& os) { << "KOKKOSKERNELS_ENABLE_TPL_CUSPARSE: no\n"; #endif } + +inline void print_cusolver_version_if_enabled(std::ostream& os) { +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER + os << " " + << "KOKKOSKERNELS_ENABLE_TPL_CUSOLVER: " << cusolver_version_string() + << "\n"; +#else + os << " " + << "KOKKOSKERNELS_ENABLE_TPL_CUSOLVER: no\n"; +#endif +} + inline void print_enabled_tpls(std::ostream& os) { #ifdef KOKKOSKERNELS_ENABLE_TPL_LAPACK os << " " @@ -96,6 +108,7 @@ inline void print_enabled_tpls(std::ostream& os) { #endif print_cublas_version_if_enabled(os); print_cusparse_version_if_enabled(os); + print_cusolver_version_if_enabled(os); #ifdef KOKKOSKERNELS_ENABLE_TPL_ROCBLAS os << " " << "KOKKOSKERNELS_ENABLE_TPL_ROCBLAS: yes\n"; @@ -110,6 +123,13 @@ inline void print_enabled_tpls(std::ostream& os) { os << " " << "KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE: no\n"; #endif +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER + os << " " + << "KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER: yes\n"; +#else + os << " " + << "KOKKOSKERNELS_ENABLE_TPL_ROCOLVER: no\n"; +#endif #ifdef KOKKOSKERNELS_ENABLE_TPL_METIS os << "KOKKOSKERNELS_ENABLE_TPL_METIS: yes\n"; #else diff --git a/common/src/KokkosKernels_TplsVersion.hpp b/common/src/KokkosKernels_TplsVersion.hpp index 38de7c1399..3e00d72457 100644 --- a/common/src/KokkosKernels_TplsVersion.hpp +++ b/common/src/KokkosKernels_TplsVersion.hpp @@ -28,6 +28,10 @@ #include "cusparse.h" #endif +#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) +#include "cusolver_common.h" +#endif + namespace KokkosKernels { #if defined(KOKKOSKERNELS_ENABLE_TPL_CUBLAS) @@ -53,5 +57,16 @@ inline std::string cusparse_version_string() { } #endif +#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) +inline std::string cusolver_version_string() { + std::stringstream ss; + + ss << CUSOLVER_VER_MAJOR << "." << CUSOLVER_VER_MINOR << "." + << CUSOLVER_VER_PATCH << "." << CUSOLVER_VER_BUILD; + + return ss.str(); +} +#endif + } // namespace KokkosKernels #endif // _KOKKOSKERNELS_TPLS_VERSIONS_HPP diff --git a/common/src/KokkosKernels_Utils.hpp b/common/src/KokkosKernels_Utils.hpp index e1c15505ff..ba8049cecf 100644 --- a/common/src/KokkosKernels_Utils.hpp +++ b/common/src/KokkosKernels_Utils.hpp @@ -890,7 +890,7 @@ void permute_block_vector(typename idx_array_type::value_type num_elements, // TODO BMK: clean this up by removing 1st argument. It is unused but // its name gives the impression that only num_elements of the vector are // zeroed, when really it's always the whole thing. -template +template void zero_vector(ExecSpaceIn &exec_space_in, typename value_array_type::value_type /* num_elements */, value_array_type &vector) { @@ -906,8 +906,7 @@ void zero_vector(typename value_array_type::value_type /* num_elements */, using ne_tmp_t = typename value_array_type::value_type; ne_tmp_t ne_tmp = ne_tmp_t(0); MyExecSpace my_exec_space; - zero_vector(my_exec_space, ne_tmp, - vector); + zero_vector(my_exec_space, ne_tmp, vector); } template diff --git a/common/src/KokkosKernels_helpers.hpp b/common/src/KokkosKernels_helpers.hpp index b36360b991..1b725f2f5c 100644 --- a/common/src/KokkosKernels_helpers.hpp +++ b/common/src/KokkosKernels_helpers.hpp @@ -29,11 +29,11 @@ namespace Impl { // Used to reduce number of code instantiations. template struct GetUnifiedLayoutPreferring { - typedef typename std::conditional< - ((ViewType::rank == 1) && (!std::is_same::value)) || - ((ViewType::rank == 0)), - PreferredLayoutType, typename ViewType::array_layout>::type array_layout; + using array_layout = typename std::conditional< + ((ViewType::rank == 1) && !std::is_same_v) || + (ViewType::rank == 0), + PreferredLayoutType, typename ViewType::array_layout>::type; }; template diff --git a/common/src/KokkosLinAlg_config.h b/common/src/KokkosLinAlg_config.h index fccfe799ca..fe97c1de8b 100644 --- a/common/src/KokkosLinAlg_config.h +++ b/common/src/KokkosLinAlg_config.h @@ -18,6 +18,8 @@ #ifndef KOKKOSLINALG_CONFIG_H #define KOKKOSLINALG_CONFIG_H +[[deprecated("KokkosLinAlg_config.h is deprecated!")]] + #include #endif // KOKKOSLINALG_CONFIG_H diff --git a/docs/developer/apidocs/sparse.rst b/docs/developer/apidocs/sparse.rst index 415f72eec8..3a55e50c8b 100644 --- a/docs/developer/apidocs/sparse.rst +++ b/docs/developer/apidocs/sparse.rst @@ -94,3 +94,16 @@ par_ilut gmres ----- .. doxygenfunction:: gmres(KernelHandle* handle, AMatrix& A, BType& B, XType& X, Preconditioner* precond) + +sptrsv +------ +.. doxygenfunction:: sptrsv_symbolic(const ExecutionSpace &space, KernelHandle *handle, lno_row_view_t_ rowmap, lno_nnz_view_t_ entries) +.. doxygenfunction:: sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, lno_nnz_view_t_ entries) +.. doxygenfunction:: sptrsv_symbolic(ExecutionSpace &space, KernelHandle *handle, lno_row_view_t_ rowmap, lno_nnz_view_t_ entries, scalar_nnz_view_t_ values) +.. doxygenfunction:: sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, lno_nnz_view_t_ entries, scalar_nnz_view_t_ values) +.. doxygenfunction:: sptrsv_solve(ExecutionSpace &space, KernelHandle *handle, lno_row_view_t_ rowmap, lno_nnz_view_t_ entries, scalar_nnz_view_t_ values, BType b, XType x) +.. doxygenfunction:: sptrsv_solve(KernelHandle *handle, lno_row_view_t_ rowmap, lno_nnz_view_t_ entries, scalar_nnz_view_t_ values, BType b, XType x) +.. doxygenfunction:: sptrsv_solve(ExecutionSpace &space, KernelHandle *handle, XType x, XType b) +.. doxygenfunction:: sptrsv_solve(KernelHandle *handle, XType x, XType b) +.. doxygenfunction:: sptrsv_solve(ExecutionSpace &space, KernelHandle *handleL, KernelHandle *handleU, XType x, XType b) +.. doxygenfunction:: sptrsv_solve(KernelHandle *handleL, KernelHandle *handleU, XType x, XType b) diff --git a/docs/requirements.txt b/docs/requirements.txt index 188f51e62d..75f092707b 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1 +1,2 @@ -breathe \ No newline at end of file +breathe +sphinx-rtd-theme \ No newline at end of file diff --git a/example/wiki/CMakeLists.txt b/example/wiki/CMakeLists.txt index 11c6e0d97d..1e751f5797 100644 --- a/example/wiki/CMakeLists.txt +++ b/example/wiki/CMakeLists.txt @@ -1,2 +1,3 @@ +ADD_SUBDIRECTORY(blas) ADD_SUBDIRECTORY(sparse) ADD_SUBDIRECTORY(graph) diff --git a/example/wiki/blas/CMakeLists.txt b/example/wiki/blas/CMakeLists.txt new file mode 100644 index 0000000000..245957bc89 --- /dev/null +++ b/example/wiki/blas/CMakeLists.txt @@ -0,0 +1,19 @@ +KOKKOSKERNELS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR}) +KOKKOSKERNELS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}) + +KOKKOSKERNELS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../../../../test_common) + +KOKKOSKERNELS_ADD_EXECUTABLE_AND_TEST( + wiki_blas2_ger + SOURCES KokkosBlas2_wiki_ger.cpp + ) + +KOKKOSKERNELS_ADD_EXECUTABLE_AND_TEST( + wiki_blas2_syr + SOURCES KokkosBlas2_wiki_syr.cpp + ) + +KOKKOSKERNELS_ADD_EXECUTABLE_AND_TEST( + wiki_blas2_syr2 + SOURCES KokkosBlas2_wiki_syr2.cpp + ) diff --git a/example/wiki/blas/KokkosBlas2_wiki_ger.cpp b/example/wiki/blas/KokkosBlas2_wiki_ger.cpp new file mode 100644 index 0000000000..89eaaf9292 --- /dev/null +++ b/example/wiki/blas/KokkosBlas2_wiki_ger.cpp @@ -0,0 +1,23 @@ +#include +#include + +int main(int argc, char* argv[]) { + Kokkos::initialize(argc, argv); + { + constexpr int M = 5; + constexpr int N = 4; + + Kokkos::View A("A", M, N); + Kokkos::View x("X", M); + Kokkos::View y("Y", N); + + Kokkos::deep_copy(A, 1.0); + Kokkos::deep_copy(x, 3.0); + Kokkos::deep_copy(y, 1.3); + + const double alpha = Kokkos::ArithTraits::one(); + + KokkosBlas::ger("T", alpha, x, y, A); + } + Kokkos::finalize(); +} diff --git a/example/wiki/blas/KokkosBlas2_wiki_syr.cpp b/example/wiki/blas/KokkosBlas2_wiki_syr.cpp new file mode 100644 index 0000000000..26c6a489b8 --- /dev/null +++ b/example/wiki/blas/KokkosBlas2_wiki_syr.cpp @@ -0,0 +1,20 @@ +#include +#include + +int main(int argc, char* argv[]) { + Kokkos::initialize(argc, argv); + { + constexpr int M = 5; + + Kokkos::View A("A", M, M); + Kokkos::View x("X", M); + + Kokkos::deep_copy(A, 1.0); + Kokkos::deep_copy(x, 3.0); + + const double alpha = double(1.0); + + KokkosBlas::syr("T", "U", alpha, x, A); + } + Kokkos::finalize(); +} diff --git a/example/wiki/blas/KokkosBlas2_wiki_syr2.cpp b/example/wiki/blas/KokkosBlas2_wiki_syr2.cpp new file mode 100644 index 0000000000..c1c8e5d0d1 --- /dev/null +++ b/example/wiki/blas/KokkosBlas2_wiki_syr2.cpp @@ -0,0 +1,22 @@ +#include +#include + +int main(int argc, char* argv[]) { + Kokkos::initialize(argc, argv); + { + constexpr int M = 5; + + Kokkos::View A("A", M, M); + Kokkos::View x("X", M); + Kokkos::View y("Y", M); + + Kokkos::deep_copy(A, 1.0); + Kokkos::deep_copy(x, 3.0); + Kokkos::deep_copy(y, 1.3); + + const double alpha = double(1.0); + + KokkosBlas::syr2("T", "U", alpha, x, y, A); + } + Kokkos::finalize(); +} diff --git a/example/wiki/sparse/KokkosSparse_wiki_bsrmatrix.cpp b/example/wiki/sparse/KokkosSparse_wiki_bsrmatrix.cpp index 6a67c1aec4..eacf134f89 100644 --- a/example/wiki/sparse/KokkosSparse_wiki_bsrmatrix.cpp +++ b/example/wiki/sparse/KokkosSparse_wiki_bsrmatrix.cpp @@ -15,6 +15,7 @@ //@HEADER #include +#include #include "Kokkos_Core.hpp" diff --git a/graph/unit_test/CMakeLists.txt b/graph/unit_test/CMakeLists.txt index 63539d9776..b497953159 100644 --- a/graph/unit_test/CMakeLists.txt +++ b/graph/unit_test/CMakeLists.txt @@ -10,6 +10,12 @@ KOKKOSKERNELS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_C # # ##################### +IF (KokkosKernels_TEST_ETI_ONLY) + IF (NOT KokkosKernels_INST_DOUBLE AND NOT KokkosKernels_INST_FLOAT) + MESSAGE(FATAL_ERROR "Because only ETI'd type combinations are enabled for testing, the Kokkos Kernels graph tests require that double or float is enabled in ETI.") + ENDIF () +ENDIF () + ##################### # # # Add GPU backends # diff --git a/graph/unit_test/Test_Graph_graph_color.hpp b/graph/unit_test/Test_Graph_graph_color.hpp index 5d4eec03ca..101c489bc0 100644 --- a/graph/unit_test/Test_Graph_graph_color.hpp +++ b/graph/unit_test/Test_Graph_graph_color.hpp @@ -110,10 +110,15 @@ void test_coloring(lno_t numRows, size_type nnz, lno_t bandwidth, COLORING_DEFAULT, COLORING_SERIAL, COLORING_VB, COLORING_VBBIT, COLORING_VBCS}; -#ifdef KOKKOS_ENABLE_CUDA + // FIXME: VBD sometimes fails on CUDA and HIP +#if defined(KOKKOS_ENABLE_CUDA) if (!std::is_same::value) { coloring_algorithms.push_back(COLORING_VBD); } +#elif defined(KOKKOS_ENABLE_HIP) + if (!std::is_same::value) { + coloring_algorithms.push_back(COLORING_VBD); + } #else coloring_algorithms.push_back(COLORING_VBD); #endif @@ -174,9 +179,15 @@ void test_coloring(lno_t numRows, size_type nnz, lno_t bandwidth, } } } - EXPECT_TRUE((num_conflict == conf)); - - EXPECT_TRUE((num_conflict == 0)); + EXPECT_TRUE((num_conflict == conf)) + << "Coloring algo " << (int)coloring_algorithm + << ": kk_is_d1_coloring_valid returned incorrect number of conflicts (" + << num_conflict << ", should be " << conf << ")"; + + EXPECT_TRUE((num_conflict == 0)) + << "Coloring algo " << (int)coloring_algorithm + << ": D1 coloring produced invalid coloring (" << num_conflict + << " conflicts)"; } // device::execution_space::finalize(); } diff --git a/lapack/CMakeLists.txt b/lapack/CMakeLists.txt index 7c0c3183bd..f825a2184a 100644 --- a/lapack/CMakeLists.txt +++ b/lapack/CMakeLists.txt @@ -28,19 +28,12 @@ IF (KOKKOSKERNELS_ENABLE_TPL_LAPACK OR KOKKOSKERNELS_ENABLE_TPL_MKL OR KOKKOSKER ENDIF() # Include cuda lapack TPL source file -IF (KOKKOSKERNELS_ENABLE_TPL_MAGMA) +IF (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) LIST(APPEND SOURCES lapack/tpls/KokkosLapack_Cuda_tpl.cpp ) ENDIF() -# Include rocm lapack TPL source file -IF (KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER) - LIST(APPEND SOURCES - lapack/tpls/KokkosLapack_Rocm_tpl.cpp - ) -ENDIF() - ################## # # # ETI generation # @@ -65,3 +58,10 @@ KOKKOSKERNELS_GENERATE_ETI(Lapack_trtri trtri SOURCE_LIST SOURCES TYPE_LISTS FLOATS LAYOUTS DEVICES ) + +KOKKOSKERNELS_GENERATE_ETI(Lapack_svd svd + COMPONENTS lapack + HEADER_LIST ETI_HEADERS + SOURCE_LIST SOURCES + TYPE_LISTS FLOATS LAYOUTS DEVICES +) diff --git a/lapack/eti/generated_specializations_cpp/svd/KokkosLapack_svd_eti_spec_inst.cpp.in b/lapack/eti/generated_specializations_cpp/svd/KokkosLapack_svd_eti_spec_inst.cpp.in new file mode 100644 index 0000000000..62dd75475f --- /dev/null +++ b/lapack/eti/generated_specializations_cpp/svd/KokkosLapack_svd_eti_spec_inst.cpp.in @@ -0,0 +1,26 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + + +#define KOKKOSKERNELS_IMPL_COMPILE_LIBRARY true +#include "KokkosKernels_config.h" +#include "KokkosLapack_svd_spec.hpp" + +namespace KokkosLapack { +namespace Impl { +@LAPACK_SVD_ETI_INST_BLOCK@ + } //IMPL +} //Kokkos diff --git a/lapack/eti/generated_specializations_hpp/KokkosLapack_svd_eti_spec_avail.hpp.in b/lapack/eti/generated_specializations_hpp/KokkosLapack_svd_eti_spec_avail.hpp.in new file mode 100644 index 0000000000..49e526b7e8 --- /dev/null +++ b/lapack/eti/generated_specializations_hpp/KokkosLapack_svd_eti_spec_avail.hpp.in @@ -0,0 +1,24 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSLAPACK_SVD_ETI_SPEC_AVAIL_HPP_ +#define KOKKOSLAPACK_SVD_ETI_SPEC_AVAIL_HPP_ +namespace KokkosLapack { +namespace Impl { +@LAPACK_SVD_ETI_AVAIL_BLOCK@ + } //IMPL +} //Kokkos +#endif diff --git a/lapack/impl/KokkosLapack_gesv_spec.hpp b/lapack/impl/KokkosLapack_gesv_spec.hpp index b9f8549311..97d74280ff 100644 --- a/lapack/impl/KokkosLapack_gesv_spec.hpp +++ b/lapack/impl/KokkosLapack_gesv_spec.hpp @@ -28,7 +28,7 @@ namespace KokkosLapack { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct gesv_eti_spec_avail { enum : bool { value = false }; }; @@ -46,12 +46,16 @@ struct gesv_eti_spec_avail { EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ template <> \ struct gesv_eti_spec_avail< \ + EXEC_SPACE_TYPE, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ Kokkos::View, \ - Kokkos::MemoryTraits > > { \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ enum : bool { value = true }; \ }; @@ -65,24 +69,28 @@ namespace Impl { // Unification layer /// \brief Implementation of KokkosLapack::gesv. -template ::value, - bool eti_spec_avail = gesv_eti_spec_avail::value> +template ::value, + bool eti_spec_avail = + gesv_eti_spec_avail::value> struct GESV { - static void gesv(const AMatrix &A, const BXMV &B, const IPIVV &IPIV); + static void gesv(const ExecutionSpace &space, const AMatrix &A, const BXMV &B, + const IPIVV &IPIV); }; #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY //! Full specialization of gesv for multi vectors. // Unification layer -template -struct GESV { - static void gesv(const AMatrix & /* A */, const BXMV & /* B */, - const IPIVV & /* IPIV */) { +template +struct GESV { + static void gesv(const ExecutionSpace & /* space */, const AMatrix & /* A */, + const BXMV & /* B */, const IPIVV & /* IPIV */) { // NOTE: Might add the implementation of KokkosLapack::gesv later throw std::runtime_error( "No fallback implementation of GESV (general LU factorization & solve) " - "exists. Enable LAPACK and/or MAGMA TPL."); + "exists. Enable LAPACK, CUSOLVER, ROCSOLVER or MAGMA TPL."); } }; @@ -100,31 +108,33 @@ struct GESV { #define KOKKOSLAPACK_GESV_ETI_SPEC_DECL(SCALAR_TYPE, LAYOUT_TYPE, \ EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ extern template struct GESV< \ + EXEC_SPACE_TYPE, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ false, true>; #define KOKKOSLAPACK_GESV_ETI_SPEC_INST(SCALAR_TYPE, LAYOUT_TYPE, \ EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ template struct GESV< \ + EXEC_SPACE_TYPE, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::MemoryTraits>, \ false, true>; #include diff --git a/lapack/impl/KokkosLapack_svd_impl.hpp b/lapack/impl/KokkosLapack_svd_impl.hpp new file mode 100644 index 0000000000..49df758936 --- /dev/null +++ b/lapack/impl/KokkosLapack_svd_impl.hpp @@ -0,0 +1,34 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSLAPACK_IMPL_SVD_HPP_ +#define KOKKOSLAPACK_IMPL_SVD_HPP_ + +/// \file KokkosLapack_svd_impl.hpp +/// \brief Implementation(s) of singular value decomposition of a dense matrix. + +#include +#include + +namespace KokkosLapack { +namespace Impl { + +// NOTE: Might add the implementation of KokkosLapack::svd later + +} // namespace Impl +} // namespace KokkosLapack + +#endif // KOKKOSLAPACK_IMPL_SVD_HPP diff --git a/lapack/impl/KokkosLapack_svd_spec.hpp b/lapack/impl/KokkosLapack_svd_spec.hpp new file mode 100644 index 0000000000..fc0a34f790 --- /dev/null +++ b/lapack/impl/KokkosLapack_svd_spec.hpp @@ -0,0 +1,156 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER +#ifndef KOKKOSLAPACK_IMPL_SVD_SPEC_HPP_ +#define KOKKOSLAPACK_IMPL_SVD_SPEC_HPP_ + +#include +#include +#include + +// Include the actual functors +#if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY +#include +#endif + +namespace KokkosLapack { +namespace Impl { +// Specialization struct which defines whether a specialization exists +template +struct svd_eti_spec_avail { + enum : bool { value = false }; +}; +} // namespace Impl +} // namespace KokkosLapack + +// +// Macro for declaration of full specialization availability +// KokkosLapack::Impl::SVD. This is NOT for users!!! All +// the declarations of full specializations go in this header file. +// We may spread out definitions (see _INST macro below) across one or +// more .cpp files. +// +#define KOKKOSLAPACK_SVD_ETI_SPEC_AVAIL(SCALAR_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + template <> \ + struct svd_eti_spec_avail< \ + EXEC_SPACE_TYPE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type *, LAYOUT_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ + }; + +// Include the actual specialization declarations +#include +#include + +namespace KokkosLapack { +namespace Impl { + +// Unification layer +/// \brief Implementation of KokkosLapack::svd. + +template ::value, + bool eti_spec_avail = svd_eti_spec_avail< + ExecutionSpace, AMatrix, SVector, UMatrix, VMatrix>::value> +struct SVD { + static void svd(const ExecutionSpace &space, const char jobu[], + const char jobvt[], const AMatrix &A, const SVector &S, + const UMatrix &U, const VMatrix &Vt); +}; + +#if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY +//! Full specialization of svd +// Unification layer +template +struct SVD { + static void svd(const ExecutionSpace & /* space */, const char * /* jobu */, + const char * /* jobvt */, const AMatrix & /* A */, + const SVector & /* S */, const UMatrix & /* U */, + const VMatrix & /* Vt */) { + // NOTE: Might add the implementation of KokkosLapack::svd later + throw std::runtime_error( + "No fallback implementation of SVD (singular value decomposition) " + "exists. Enable LAPACK, CUSOLVER or ROCSOLVER TPL to use this " + "function."); + } +}; + +#endif +} // namespace Impl +} // namespace KokkosLapack + +// +// Macro for declaration of full specialization of +// KokkosLapack::Impl::SVD. This is NOT for users!!! All +// the declarations of full specializations go in this header file. +// We may spread out definitions (see _DEF macro below) across one or +// more .cpp files. +// +#define KOKKOSLAPACK_SVD_ETI_SPEC_DECL(SCALAR_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + extern template struct SVD< \ + EXEC_SPACE_TYPE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type *, LAYOUT_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + false, true>; + +#define KOKKOSLAPACK_SVD_ETI_SPEC_INST(SCALAR_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + template struct SVD< \ + EXEC_SPACE_TYPE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type *, LAYOUT_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + false, true>; + +#include + +#endif // KOKKOSLAPACK_IMPL_SVD_SPEC_HPP_ diff --git a/lapack/src/KokkosLapack_gesv.hpp b/lapack/src/KokkosLapack_gesv.hpp index 4c9058f8ab..b66583bbdf 100644 --- a/lapack/src/KokkosLapack_gesv.hpp +++ b/lapack/src/KokkosLapack_gesv.hpp @@ -34,28 +34,50 @@ namespace KokkosLapack { /// \brief Solve the dense linear equation system A*X = B. /// +/// \tparam ExecutionSpace the space where the kernel will run. /// \tparam AMatrix Input matrix/Output LU, as a 2-D Kokkos::View. /// \tparam BXMV Input (right-hand side)/Output (solution) (multi)vector, as a -/// 1-D or 2-D Kokkos::View. \tparam IPIVV Output pivot indices, as a 1-D -/// Kokkos::View +/// 1-D or 2-D Kokkos::View. +/// \tparam IPIVV Output pivot indices, as a 1-D Kokkos::View /// +/// \param space [in] execution space instance used to specified how to execute +/// the gesv kernels. /// \param A [in,out] On entry, the N-by-N matrix to be solved. On exit, the /// factors L and U from /// the factorization A = P*L*U; the unit diagonal elements of L are not /// stored. /// \param B [in,out] On entry, the right hand side (multi)vector B. On exit, -/// the solution (multi)vector X. \param IPIV [out] On exit, the pivot indices -/// (for partial pivoting). If the View extents are zero and -/// its data pointer is NULL, pivoting is not used. +/// the solution (multi)vector X. +/// \param IPIV [out] On exit, the pivot indices (for partial pivoting). +/// If the View extents are zero and its data pointer is NULL, pivoting is not +/// used. /// -template -void gesv(const AMatrix& A, const BXMV& B, const IPIVV& IPIV) { - // NOTE: Currently, KokkosLapack::gesv only supports for MAGMA TPL and LAPACK - // TPL. - // MAGMA TPL should be enabled to call the MAGMA GPU interface for - // device views LAPACK TPL should be enabled to call the LAPACK - // interface for host views +template +void gesv(const ExecutionSpace& space, const AMatrix& A, const BXMV& B, + const IPIVV& IPIV) { + // NOTE: Currently, KokkosLapack::gesv only supports LAPACK, MAGMA and + // rocSOLVER TPLs. + // MAGMA/rocSOLVER TPL should be enabled to call the MAGMA/rocSOLVER GPU + // interface for device views LAPACK TPL should be enabled to call the + // LAPACK interface for host views + static_assert( + Kokkos::SpaceAccessibility::accessible); + static_assert( + Kokkos::SpaceAccessibility::accessible); +#if defined(KOKKOSKERNELS_ENABLE_TPL_MAGMA) + if constexpr (!std::is_same_v) { + static_assert( + Kokkos::SpaceAccessibility::accessible); + } +#else + static_assert( + Kokkos::SpaceAccessibility::accessible); +#endif static_assert(Kokkos::is_view::value, "KokkosLapack::gesv: A must be a Kokkos::View."); static_assert(Kokkos::is_view::value, @@ -137,15 +159,38 @@ void gesv(const AMatrix& A, const BXMV& B, const IPIVV& IPIV) { if (BXMV::rank == 1) { auto B_i = BXMV_Internal(B.data(), B.extent(0), 1); - KokkosLapack::Impl::GESV::gesv(A_i, B_i, IPIV_i); + KokkosLapack::Impl::GESV::gesv(space, A_i, B_i, IPIV_i); } else { // BXMV::rank == 2 auto B_i = BXMV_Internal(B.data(), B.extent(0), B.extent(1)); - KokkosLapack::Impl::GESV::gesv(A_i, B_i, IPIV_i); + KokkosLapack::Impl::GESV::gesv(space, A_i, B_i, IPIV_i); } } +/// \brief Solve the dense linear equation system A*X = B. +/// +/// \tparam AMatrix Input matrix/Output LU, as a 2-D Kokkos::View. +/// \tparam BXMV Input (right-hand side)/Output (solution) (multi)vector, as a +/// 1-D or 2-D Kokkos::View. +/// \tparam IPIVV Output pivot indices, as a 1-D Kokkos::View +/// +/// \param A [in,out] On entry, the N-by-N matrix to be solved. On exit, the +/// factors L and U from +/// the factorization A = P*L*U; the unit diagonal elements of L are not +/// stored. +/// \param B [in,out] On entry, the right hand side (multi)vector B. On exit, +/// the solution (multi)vector X. +/// \param IPIV [out] On exit, the pivot indices (for partial pivoting). +/// If the View extents are zero and its data pointer is NULL, pivoting is not +/// used. +/// +template +void gesv(const AMatrix& A, const BXMV& B, const IPIVV& IPIV) { + typename AMatrix::execution_space space{}; + gesv(space, A, B, IPIV); +} + } // namespace KokkosLapack #endif // KOKKOSLAPACK_GESV_HPP_ diff --git a/lapack/src/KokkosLapack_svd.hpp b/lapack/src/KokkosLapack_svd.hpp new file mode 100644 index 0000000000..71ea7cc30f --- /dev/null +++ b/lapack/src/KokkosLapack_svd.hpp @@ -0,0 +1,246 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +/// \file KokkosLapack_svd.hpp +/// \brief Singular Value Decomposition (SVD) +/// +/// This file provides KokkosLapack::svd. This function performs a +/// local (no MPI) singular value decomposition of the input matrix A +/// and returns the singular values and vectors dedending on input flags. + +#ifndef KOKKOSLAPACK_SVD_HPP_ +#define KOKKOSLAPACK_SVD_HPP_ + +#include + +#include "KokkosLapack_svd_spec.hpp" +#include "KokkosKernels_Error.hpp" + +namespace KokkosLapack { + +// clang-format off +/// \brief Compute the Singular Value Decomposition of A = U*S*Vt +/// +/// \tparam ExecutionSpace the space where the kernel will run. +/// \tparam AMatrix (mxn) matrix as a rank-2 Kokkos::View. +/// \tparam SVector min(m,n) vector as a rank-1 Kokkos::View +/// \tparam UMatrix (mxm) matrix as a rank-2 Kokkos::View +/// \tparam VMatrix (nxn) matrix as a rank-2 Kokkos::View +/// +/// \param space [in] execution space instance used to specified how to execute +/// the svd kernels. +/// \param jobu [in] flag to control the computation of the left singular +/// vectors when set to: 'A' all vectors are computed, 'S' the first min(m,n) +/// singular vectors are computed, 'O' the first min(m,n) singular vectors are +/// overwritten into A, 'N' no singular vectors are computed. +/// \param jobvt [in] flag to control the computation of the right singular +/// vectors when set to: 'A' all vectors are computed, 'S' the first min(m,n) +/// singular vectors are computed, 'O' the first min(m,n) singular vectors are +/// overwritten into A, 'N' no singular vectors are computed. +/// \param A [in] An m-by-n matrix to be decomposed using its singular values. +/// \param S [out] Vector of the min(m, n) singular values of A. +/// \param U [out] the first min(m, n) columns of U are the left singular +/// vectors of A. +/// \param Vt [out] the first min(m, n) columns of Vt are the right singular +/// vectors of A. +/// +// clang-format on +template +void svd(const ExecutionSpace& space, const char jobu[], const char jobvt[], + const AMatrix& A, const SVector& S, const UMatrix& U, + const VMatrix& Vt) { + static_assert( + Kokkos::SpaceAccessibility::accessible); + static_assert( + Kokkos::SpaceAccessibility::accessible); + static_assert( + Kokkos::SpaceAccessibility::accessible); + static_assert( + Kokkos::SpaceAccessibility::accessible); + static_assert(Kokkos::is_view::value, + "KokkosLapack::svd: A must be a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosLapack::svd: S must be a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosLapack::svd: U must be a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "KokkosLapack::svd: Vt must be a Kokkos::View."); + static_assert(AMatrix::rank() == 2, "KokkosLapack::svd: A must have rank 2."); + static_assert(SVector::rank() == 1, "KokkosLapack::svd: S must have rank 1."); + static_assert(UMatrix::rank() == 2, "KokkosLapack::svd: U must have rank 2."); + static_assert(VMatrix::rank() == 2, + "KokkosLapack::svd: Vt must have rank 2."); + + int64_t m = A.extent(0); + int64_t n = A.extent(1); + int64_t rankA = Kokkos::min(m, n); + + // No work to do since the matrix is empty... + // Also do not send a matrix with size zero + // to Lapack TPLs or they will complain! + if ((m == 0) || (n == 0)) { + return; + } + + // Check the jobu and jobvt control flags + // The only valid options there are 'A', 'S', 'O' and 'N' + const bool is_jobu_invalid = + !((jobu[0] == 'A') || (jobu[0] == 'a') || (jobu[0] == 'S') || + (jobu[0] == 's') || (jobu[0] == 'O') || (jobu[0] == 'o') || + (jobu[0] == 'N') || (jobu[0] == 'n')); + + const bool is_jobvt_invalid = + !((jobvt[0] == 'A') || (jobvt[0] == 'a') || (jobvt[0] == 'S') || + (jobvt[0] == 's') || (jobvt[0] == 'O') || (jobvt[0] == 'o') || + (jobvt[0] == 'N') || (jobvt[0] == 'n')); + + if (is_jobu_invalid && is_jobvt_invalid) { + std::ostringstream oss; + oss << "KokkosLapack::svd: both jobu and jobvt are invalid!\n" + << "Possible values are A, S, O or N, submitted values are " << jobu[0] + << " and " << jobvt[0] << "\n"; + KokkosKernels::Impl::throw_runtime_exception(oss.str()); + } + if (is_jobu_invalid) { + std::ostringstream oss; + oss << "KokkosLapack::svd: jobu is invalid!\n" + << "Possible values are A, S, O or N, submitted value is " << jobu[0] + << "\n"; + KokkosKernels::Impl::throw_runtime_exception(oss.str()); + } + if (is_jobvt_invalid) { + std::ostringstream oss; + oss << "KokkosLapack::svd: jobvt is invalid!\n" + << "Possible values are A, S, O or N, submitted value is " << jobvt[0] + << "\n"; + KokkosKernels::Impl::throw_runtime_exception(oss.str()); + } + + if (((jobu[0] == 'O') || (jobu[0] == 'o')) && + ((jobvt[0] == 'O') || (jobvt[0] == 'o'))) { + std::ostringstream oss; + oss << "KokkosLapack::svd: jobu and jobvt cannot be O at the same time!\n"; + KokkosKernels::Impl::throw_runtime_exception(oss.str()); + } + + // Check validity of output views sizes + // Note that of jobu/jobvt are set to O or N + // then the associated matrix does not need storage + bool is_extent_invalid = false; + std::ostringstream os; + if (S.extent_int(0) != rankA) { + is_extent_invalid = true; + os << "KokkosLapack::svd: S has extent " << S.extent(0) << ", instead of " + << rankA << ".\n"; + } + if ((jobu[0] == 'A') || (jobu[0] == 'a') || (jobu[0] == 'S') || + (jobu[0] == 's')) { + if (U.extent_int(0) != m || U.extent_int(1) != m) { + is_extent_invalid = true; + os << "KokkosLapack::svd: U has extents (" << U.extent(0) << ", " + << U.extent(1) << ") instead of (" << m << ", " << m << ").\n"; + } + } + if ((jobvt[0] == 'A') || (jobvt[0] == 'a') || (jobvt[0] == 'S') || + (jobvt[0] == 's')) { + if (Vt.extent_int(0) != n || Vt.extent_int(1) != n) { + is_extent_invalid = true; + os << "KokkosLapack::svd: V has extents (" << Vt.extent(0) << ", " + << Vt.extent(1) << ") instead of (" << n << ", " << n << ").\n"; + } + } + if (is_extent_invalid) { + KokkosKernels::Impl::throw_runtime_exception(os.str()); + } + +#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) + if (std::is_same_v && + (A.extent(0) < A.extent(1))) { + throw std::runtime_error( + "CUSOLVER does not support SVD for matrices with more columns " + "than rows, you can transpose you matrix first then compute " + "SVD of that transpose: At=VSUt, and swap the output U and Vt" + " and transpose them to recover the desired SVD."); + } +#endif + + using AMatrix_Internal = Kokkos::View< + typename AMatrix::non_const_value_type**, typename AMatrix::array_layout, + typename AMatrix::device_type, Kokkos::MemoryTraits>; + + using SVector_Internal = Kokkos::View< + typename SVector::non_const_value_type*, typename SVector::array_layout, + typename SVector::device_type, Kokkos::MemoryTraits>; + + using UMatrix_Internal = Kokkos::View< + typename UMatrix::non_const_value_type**, typename UMatrix::array_layout, + typename UMatrix::device_type, Kokkos::MemoryTraits>; + + using VMatrix_Internal = Kokkos::View< + typename VMatrix::non_const_value_type**, typename VMatrix::array_layout, + typename VMatrix::device_type, Kokkos::MemoryTraits>; + + AMatrix_Internal A_i = A; + SVector_Internal S_i = S; + UMatrix_Internal U_i = U; + VMatrix_Internal Vt_i = Vt; + + KokkosLapack::Impl::SVD::svd(space, jobu, + jobvt, A_i, + S_i, U_i, + Vt_i); +} + +// clang-format off +/// \brief Compute the Singular Value Decomposition of A = U*S*Vt +/// +/// \tparam AMatrix (mxn) matrix as a rank-2 Kokkos::View. +/// \tparam SVector min(m,n) vector as a rank-1 Kokkos::View +/// \tparam UMatrix (mxm) matrix as a rank-2 Kokkos::View +/// \tparam VMatrix (nxn) matrix as a rank-2 Kokkos::View +/// +/// \param jobu [in] flag to control the computation of the left singular +/// vectors when set to: 'A' all vectors are computed, 'S' the first min(m,n) +/// singular vectors are computed, 'O' the first min(m,n) singular vectors are +/// overwritten into A, 'N' no singular vectors are computed. +/// \param jobvt [in] flag to control the computation of the right singular +/// vectors when set to: 'A' all vectors are computed, 'S' the first min(m,n) +/// singular vectors are computed, 'O' the first min(m,n) singular vectors are +/// overwritten into A, 'N' no singular vectors are computed. +/// \param A [in] An m-by-n matrix to be decomposed using its singular values. +/// \param S [out] Vector of the min(m, n) singular values of A. +/// \param U [out] the first min(m, n) columns of U are the left singular +/// vectors of A. +/// \param Vt [out] the first min(m, n) columns of Vt are the right singular +/// vectors of A. +/// +// clang-format on +template +void svd(const char jobu[], const char jobvt[], const AMatrix& A, + const SVector& S, const UMatrix& U, const VMatrix& Vt) { + typename AMatrix::execution_space space{}; + svd(space, jobu, jobvt, A, S, U, Vt); +} + +} // namespace KokkosLapack + +#endif // KOKKOSLAPACK_SVD_HPP_ diff --git a/lapack/tpls/KokkosLapack_Cuda_tpl.hpp b/lapack/tpls/KokkosLapack_Cuda_tpl.hpp index 2ce9f69954..6749a4740f 100644 --- a/lapack/tpls/KokkosLapack_Cuda_tpl.hpp +++ b/lapack/tpls/KokkosLapack_Cuda_tpl.hpp @@ -16,6 +16,29 @@ #ifndef KOKKOSLAPACK_CUDA_TPL_HPP_ #define KOKKOSLAPACK_CUDA_TPL_HPP_ +#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) +#include "KokkosLapack_cusolver.hpp" + +namespace KokkosLapack { +namespace Impl { + +CudaLapackSingleton::CudaLapackSingleton() { + cusolverStatus_t stat = cusolverDnCreate(&handle); + if (stat != CUSOLVER_STATUS_SUCCESS) + Kokkos::abort("CUSOLVER initialization failed\n"); + + Kokkos::push_finalize_hook([&]() { cusolverDnDestroy(handle); }); +} + +CudaLapackSingleton& CudaLapackSingleton::singleton() { + static CudaLapackSingleton s; + return s; +} + +} // namespace Impl +} // namespace KokkosLapack +#endif // defined (KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) + #if defined(KOKKOSKERNELS_ENABLE_TPL_MAGMA) #include diff --git a/lapack/tpls/KokkosLapack_Host_tpl.cpp b/lapack/tpls/KokkosLapack_Host_tpl.cpp index d629a17f1d..add0a802bd 100644 --- a/lapack/tpls/KokkosLapack_Host_tpl.cpp +++ b/lapack/tpls/KokkosLapack_Host_tpl.cpp @@ -38,6 +38,31 @@ void F77_BLAS_MANGLE(cgesv, CGESV)(int*, int*, std::complex*, int*, int*, void F77_BLAS_MANGLE(zgesv, ZGESV)(int*, int*, std::complex*, int*, int*, std::complex*, int*, int*); +/// +/// Gesvd +/// + +void F77_BLAS_MANGLE(sgesvd, SGESVD)(const char*, const char*, const int*, + const int*, float*, const int*, float*, + float*, const int*, float*, const int*, + float*, int*, int*); +void F77_BLAS_MANGLE(dgesvd, DGESVD)(const char*, const char*, const int*, + const int*, double*, const int*, double*, + double*, const int*, double*, const int*, + double*, int*, int*); +void F77_BLAS_MANGLE(cgesvd, CGESVD)(const char*, const char*, const int*, + const int*, std::complex*, + const int*, float*, std::complex*, + const int*, std::complex*, + const int*, std::complex*, int*, + float*, int*); +void F77_BLAS_MANGLE(zgesvd, ZGESVD)(const char*, const char*, const int*, + const int*, std::complex*, + const int*, double*, std::complex*, + const int*, std::complex*, + const int*, std::complex*, int*, + double*, int*); + /// /// Trtri /// @@ -64,6 +89,11 @@ void F77_BLAS_MANGLE(ztrtri, ZTRTRI)(const char*, const char*, int*, #define F77_FUNC_CGESV F77_BLAS_MANGLE(cgesv, CGESV) #define F77_FUNC_ZGESV F77_BLAS_MANGLE(zgesv, ZGESV) +#define F77_FUNC_SGESVD F77_BLAS_MANGLE(sgesvd, SGESVD) +#define F77_FUNC_DGESVD F77_BLAS_MANGLE(dgesvd, DGESVD) +#define F77_FUNC_CGESVD F77_BLAS_MANGLE(cgesvd, CGESVD) +#define F77_FUNC_ZGESVD F77_BLAS_MANGLE(zgesvd, ZGESVD) + #define F77_FUNC_STRTRI F77_BLAS_MANGLE(strtri, STRTRI) #define F77_FUNC_DTRTRI F77_BLAS_MANGLE(dtrtri, DTRTRI) #define F77_FUNC_CTRTRI F77_BLAS_MANGLE(ctrtri, CTRTRI) @@ -82,6 +112,15 @@ void HostLapack::gesv(int n, int rhs, float* a, int lda, int* ipiv, F77_FUNC_SGESV(&n, &rhs, a, &lda, ipiv, b, &ldb, &info); } template <> +void HostLapack::gesvd(const char jobu, const char jobvt, const int m, + const int n, float* a, const int lda, float* s, + float* u, const int ldu, float* vt, + const int ldvt, float* work, int lwork, + float* /*rwork*/, int info) { + F77_FUNC_SGESVD(&jobu, &jobvt, &m, &n, a, &lda, s, u, &ldu, vt, &ldvt, work, + &lwork, &info); +} +template <> int HostLapack::trtri(const char uplo, const char diag, int n, const float* a, int lda) { int info = 0; @@ -99,6 +138,15 @@ void HostLapack::gesv(int n, int rhs, double* a, int lda, int* ipiv, F77_FUNC_DGESV(&n, &rhs, a, &lda, ipiv, b, &ldb, &info); } template <> +void HostLapack::gesvd(const char jobu, const char jobvt, const int m, + const int n, double* a, const int lda, double* s, + double* u, const int ldu, double* vt, + const int ldvt, double* work, int lwork, + double* /*rwork*/, int info) { + F77_FUNC_DGESVD(&jobu, &jobvt, &m, &n, a, &lda, s, u, &ldu, vt, &ldvt, work, + &lwork, &info); +} +template <> int HostLapack::trtri(const char uplo, const char diag, int n, const double* a, int lda) { int info = 0; @@ -118,6 +166,15 @@ void HostLapack >::gesv(int n, int rhs, F77_FUNC_CGESV(&n, &rhs, a, &lda, ipiv, b, &ldb, &info); } template <> +void HostLapack >::gesvd( + const char jobu, const char jobvt, const int m, const int n, + std::complex* a, const int lda, float* s, std::complex* u, + const int ldu, std::complex* vt, const int ldvt, + std::complex* work, int lwork, float* rwork, int info) { + F77_FUNC_CGESVD(&jobu, &jobvt, &m, &n, a, &lda, s, u, &ldu, vt, &ldvt, work, + &lwork, rwork, &info); +} +template <> int HostLapack >::trtri(const char uplo, const char diag, int n, const std::complex* a, int lda) { @@ -138,6 +195,15 @@ void HostLapack >::gesv(int n, int rhs, F77_FUNC_ZGESV(&n, &rhs, a, &lda, ipiv, b, &ldb, &info); } template <> +void HostLapack >::gesvd( + const char jobu, const char jobvt, const int m, const int n, + std::complex* a, const int lda, double* s, std::complex* u, + const int ldu, std::complex* vt, const int ldvt, + std::complex* work, int lwork, double* rwork, int info) { + F77_FUNC_ZGESVD(&jobu, &jobvt, &m, &n, a, &lda, s, u, &ldu, vt, &ldvt, work, + &lwork, rwork, &info); +} +template <> int HostLapack >::trtri(const char uplo, const char diag, int n, const std::complex* a, diff --git a/lapack/tpls/KokkosLapack_Host_tpl.hpp b/lapack/tpls/KokkosLapack_Host_tpl.hpp index d74099aaec..9eca83afea 100644 --- a/lapack/tpls/KokkosLapack_Host_tpl.hpp +++ b/lapack/tpls/KokkosLapack_Host_tpl.hpp @@ -33,6 +33,12 @@ struct HostLapack { static void gesv(int n, int rhs, T *a, int lda, int *ipiv, T *b, int ldb, int info); + static void gesvd(const char jobu, const char jobvt, const int m, const int n, + T *A, const int lda, + typename Kokkos::ArithTraits::mag_type *S, T *U, + const int ldu, T *Vt, const int ldvt, T *work, int lwork, + typename Kokkos::ArithTraits::mag_type *rwork, int info); + static int trtri(const char uplo, const char diag, int n, const T *a, int lda); }; diff --git a/lapack/tpls/KokkosLapack_cusolver.hpp b/lapack/tpls/KokkosLapack_cusolver.hpp new file mode 100644 index 0000000000..006fd68b6f --- /dev/null +++ b/lapack/tpls/KokkosLapack_cusolver.hpp @@ -0,0 +1,92 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSLAPACK_CUSOLVER_HPP_ +#define KOKKOSLAPACK_CUSOLVER_HPP_ + +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER +#include + +namespace KokkosLapack { +namespace Impl { + +// Declaration of the singleton for cusolver +// this is the only header that needs to be +// included when using cusolverDn. +struct CudaLapackSingleton { + cusolverDnHandle_t handle; + + CudaLapackSingleton(); + + static CudaLapackSingleton& singleton(); +}; + +inline void cusolver_internal_error_throw(cusolverStatus_t cusolverStatus, + const char* name, const char* file, + const int line) { + std::ostringstream out; + out << name << " error( "; + switch (cusolverStatus) { + case CUSOLVER_STATUS_NOT_INITIALIZED: + out << "CUSOLVER_STATUS_NOT_INITIALIZED): cusolver handle was not " + "created correctly."; + break; + case CUSOLVER_STATUS_ALLOC_FAILED: + out << "CUSOLVER_STATUS_ALLOC_FAILED): you might tried to allocate too " + "much memory"; + break; + case CUSOLVER_STATUS_INVALID_VALUE: + out << "CUSOLVER_STATUS_INVALID_VALUE)"; + break; + case CUSOLVER_STATUS_ARCH_MISMATCH: + out << "CUSOLVER_STATUS_ARCH_MISMATCH)"; + break; + case CUSOLVER_STATUS_EXECUTION_FAILED: + out << "CUSOLVER_STATUS_EXECUTION_FAILED)"; + break; + case CUSOLVER_STATUS_INTERNAL_ERROR: + out << "CUSOLVER_STATUS_INTERNAL_ERROR)"; + break; + case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED: + out << "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED)"; + break; + default: out << "unrecognized error code): this is bad!"; break; + } + if (file) { + out << " " << file << ":" << line; + } + throw std::runtime_error(out.str()); +} + +inline void cusolver_internal_safe_call(cusolverStatus_t cusolverStatus, + const char* name, + const char* file = nullptr, + const int line = 0) { + if (CUSOLVER_STATUS_SUCCESS != cusolverStatus) { + cusolver_internal_error_throw(cusolverStatus, name, file, line); + } +} + +// The macro below defines is the public interface for the safe cusolver calls. +// The functions themselves are protected by impl namespace. +#define KOKKOS_CUSOLVER_SAFE_CALL_IMPL(call) \ + KokkosLapack::Impl::cusolver_internal_safe_call(call, #call, __FILE__, \ + __LINE__) + +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_CUSOLVER +#endif // KOKKOSLAPACK_CUSOLVER_HPP_ diff --git a/lapack/tpls/KokkosLapack_gesv_tpl_spec_avail.hpp b/lapack/tpls/KokkosLapack_gesv_tpl_spec_avail.hpp index a3d8bb6ee9..9fbd299ca5 100644 --- a/lapack/tpls/KokkosLapack_gesv_tpl_spec_avail.hpp +++ b/lapack/tpls/KokkosLapack_gesv_tpl_spec_avail.hpp @@ -20,7 +20,7 @@ namespace KokkosLapack { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct gesv_tpl_spec_avail { enum : bool { value = false }; }; @@ -31,9 +31,12 @@ struct gesv_tpl_spec_avail { #define KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK(SCALAR, LAYOUT, MEMSPACE) \ template \ struct gesv_tpl_spec_avail< \ + ExecSpace, \ Kokkos::View, \ Kokkos::MemoryTraits >, \ Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ Kokkos::MemoryTraits > > { \ enum : bool { value = true }; \ }; @@ -46,37 +49,29 @@ KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::HostSpace) KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::HostSpace) -/* -#if defined (KOKKOSKERNELS_INST_DOUBLE) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK( double, Kokkos::LayoutRight, -Kokkos::HostSpace) #endif -#if defined (KOKKOSKERNELS_INST_FLOAT) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK( float, Kokkos::LayoutRight, -Kokkos::HostSpace) #endif -#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK( Kokkos::complex, -Kokkos::LayoutRight, Kokkos::HostSpace) #endif -#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_FLOAT_) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_LAPACK( Kokkos::complex, -Kokkos::LayoutRight, Kokkos::HostSpace) #endif -*/ #endif +} // namespace Impl +} // namespace KokkosLapack // MAGMA #ifdef KOKKOSKERNELS_ENABLE_TPL_MAGMA +#include "magma_v2.h" -#define KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA(SCALAR, LAYOUT, MEMSPACE) \ - template \ - struct gesv_tpl_spec_avail< \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits > > { \ - enum : bool { value = true }; \ +namespace KokkosLapack { +namespace Impl { +#define KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ + struct gesv_tpl_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ }; KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA(double, Kokkos::LayoutLeft, @@ -87,28 +82,85 @@ KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::CudaSpace) KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::CudaSpace) +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_MAGMA + +// CUSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER +namespace KokkosLapack { +namespace Impl { + +#define KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ + struct gesv_tpl_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; -/* -#if defined (KOKKOSKERNELS_INST_DOUBLE) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA( double, Kokkos::LayoutRight, -Kokkos::CudaSpace) #endif -#if defined (KOKKOSKERNELS_INST_FLOAT) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA( float, Kokkos::LayoutRight, -Kokkos::CudaSpace) #endif -#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_DOUBLE_) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA( -Kokkos::complex,Kokkos::LayoutRight, Kokkos::CudaSpace) #endif -#if defined (KOKKOSKERNELS_INST_KOKKOS_COMPLEX_FLOAT_) \ - && defined (KOKKOSKERNELS_INST_LAYOUTRIGHT) - KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_MAGMA( Kokkos::complex, -Kokkos::LayoutRight, Kokkos::CudaSpace) #endif -*/ +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(double, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(float, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_CUDAUVMSPACE) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(double, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(float, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) #endif } // namespace Impl } // namespace KokkosLapack +#endif // CUSOLVER + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER +#include + +namespace KokkosLapack { +namespace Impl { + +#define KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_ROCSOLVER(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ + struct gesv_tpl_spec_avail< \ + Kokkos::HIP, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_ROCSOLVER(double, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_ROCSOLVER(float, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_ROCSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSLAPACK_GESV_TPL_SPEC_AVAIL_ROCSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIPSpace) + +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER #endif diff --git a/lapack/tpls/KokkosLapack_gesv_tpl_spec_decl.hpp b/lapack/tpls/KokkosLapack_gesv_tpl_spec_decl.hpp index 5846e177d6..41592e079a 100644 --- a/lapack/tpls/KokkosLapack_gesv_tpl_spec_decl.hpp +++ b/lapack/tpls/KokkosLapack_gesv_tpl_spec_decl.hpp @@ -45,229 +45,109 @@ inline void gesv_print_specialization() { namespace KokkosLapack { namespace Impl { -#define KOKKOSLAPACK_DGESV_LAPACK(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ - struct GESV< \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef double SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - int*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_LAPACK,double]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - const int N = static_cast(A.extent(1)); \ - const int AST = static_cast(A.stride(1)); \ - const int LDA = (AST == 0) ? 1 : AST; \ - const int BST = static_cast(B.stride(1)); \ - const int LDB = (BST == 0) ? 1 : BST; \ - const int NRHS = static_cast(B.extent(1)); \ - \ - int info = 0; \ - \ - if (with_pivot) { \ - HostLapack::gesv(N, NRHS, A.data(), LDA, IPIV.data(), \ - B.data(), LDB, info); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +template +void lapackGesvWrapper(const AViewType& A, const BViewType& B, + const IPIVViewType& IPIV) { + using Scalar = typename AViewType::non_const_value_type; + + const bool with_pivot = !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); + + const int N = static_cast(A.extent(1)); + const int AST = static_cast(A.stride(1)); + const int LDA = (AST == 0) ? 1 : AST; + const int BST = static_cast(B.stride(1)); + const int LDB = (BST == 0) ? 1 : BST; + const int NRHS = static_cast(B.extent(1)); + + int info = 0; + + if (with_pivot) { + if constexpr (Kokkos::ArithTraits::is_complex) { + using MagType = typename Kokkos::ArithTraits::mag_type; + + HostLapack>::gesv( + N, NRHS, reinterpret_cast*>(A.data()), LDA, + IPIV.data(), reinterpret_cast*>(B.data()), LDB, + info); + } else { + HostLapack::gesv(N, NRHS, A.data(), LDA, IPIV.data(), B.data(), + LDB, info); + } + } +} -#define KOKKOSLAPACK_SGESV_LAPACK(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ +#define KOKKOSLAPACK_GESV_LAPACK(SCALAR, LAYOUT, EXECSPACE, MEM_SPACE) \ + template <> \ struct GESV< \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef float SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - int*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ + EXECSPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + gesv_eti_spec_avail< \ + EXECSPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AViewType = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using BViewType = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using PViewType = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_LAPACK,float]"); \ + static void gesv(const EXECSPACE& /* space */, const AViewType& A, \ + const BViewType& B, const PViewType& IPIV) { \ + Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_LAPACK," #SCALAR \ + "]"); \ gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - const int N = static_cast(A.extent(1)); \ - const int AST = static_cast(A.stride(1)); \ - const int LDA = (AST == 0) ? 1 : AST; \ - const int BST = static_cast(B.stride(1)); \ - const int LDB = (BST == 0) ? 1 : BST; \ - const int NRHS = static_cast(B.extent(1)); \ - \ - int info = 0; \ - \ - if (with_pivot) { \ - HostLapack::gesv(N, NRHS, A.data(), LDA, IPIV.data(), B.data(), \ - LDB, info); \ - } \ + lapackGesvWrapper(A, B, IPIV); \ Kokkos::Profiling::popRegion(); \ } \ }; -#define KOKKOSLAPACK_ZGESV_LAPACK(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ - struct GESV**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::complex SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - int*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosLapack::gesv[TPL_LAPACK,complex]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - const int N = static_cast(A.extent(1)); \ - const int AST = static_cast(A.stride(1)); \ - const int LDA = (AST == 0) ? 1 : AST; \ - const int BST = static_cast(B.stride(1)); \ - const int LDB = (BST == 0) ? 1 : BST; \ - const int NRHS = static_cast(B.extent(1)); \ - \ - int info = 0; \ - \ - if (with_pivot) { \ - HostLapack >::gesv( \ - N, NRHS, reinterpret_cast*>(A.data()), LDA, \ - IPIV.data(), reinterpret_cast*>(B.data()), \ - LDB, info); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -#define KOKKOSLAPACK_CGESV_LAPACK(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ - struct GESV**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::complex SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - int*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosLapack::gesv[TPL_LAPACK,complex]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - const int N = static_cast(A.extent(1)); \ - const int AST = static_cast(A.stride(1)); \ - const int LDA = (AST == 0) ? 1 : AST; \ - const int BST = static_cast(B.stride(1)); \ - const int LDB = (BST == 0) ? 1 : BST; \ - const int NRHS = static_cast(B.extent(1)); \ - \ - int info = 0; \ - \ - if (with_pivot) { \ - HostLapack >::gesv( \ - N, NRHS, reinterpret_cast*>(A.data()), LDA, \ - IPIV.data(), reinterpret_cast*>(B.data()), \ - LDB, info); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; - -KOKKOSLAPACK_DGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSLAPACK_DGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, false) - -KOKKOSLAPACK_SGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSLAPACK_SGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, false) +#if defined(KOKKOS_ENABLE_SERIAL) +KOKKOSLAPACK_GESV_LAPACK(float, Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(double, Kokkos::LayoutLeft, Kokkos::Serial, + Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial, Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial, Kokkos::HostSpace) +#endif -KOKKOSLAPACK_ZGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSLAPACK_ZGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, false) +#if defined(KOKKOS_ENABLE_OPENMP) +KOKKOSLAPACK_GESV_LAPACK(float, Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(double, Kokkos::LayoutLeft, Kokkos::OpenMP, + Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP, Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP, Kokkos::HostSpace) +#endif -KOKKOSLAPACK_CGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, true) -KOKKOSLAPACK_CGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, false) +#if defined(KOKKOS_ENABLE_THREADS) +KOKKOSLAPACK_GESV_LAPACK(float, Kokkos::LayoutLeft, Kokkos::Threads, + Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(double, Kokkos::LayoutLeft, Kokkos::Threads, + Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads, Kokkos::HostSpace) +KOKKOSLAPACK_GESV_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads, Kokkos::HostSpace) +#endif } // namespace Impl } // namespace KokkosLapack @@ -275,265 +155,403 @@ KOKKOSLAPACK_CGESV_LAPACK(Kokkos::LayoutLeft, Kokkos::HostSpace, false) // MAGMA #ifdef KOKKOSKERNELS_ENABLE_TPL_MAGMA -#include +#include namespace KokkosLapack { namespace Impl { -#define KOKKOSLAPACK_DGESV_MAGMA(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ - struct GESV< \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef double SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - magma_int_t*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_MAGMA,double]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - magma_int_t N = static_cast(A.extent(1)); \ - magma_int_t AST = static_cast(A.stride(1)); \ - magma_int_t LDA = (AST == 0) ? 1 : AST; \ - magma_int_t BST = static_cast(B.stride(1)); \ - magma_int_t LDB = (BST == 0) ? 1 : BST; \ - magma_int_t NRHS = static_cast(B.extent(1)); \ - \ - KokkosLapack::Impl::MagmaSingleton& s = \ - KokkosLapack::Impl::MagmaSingleton::singleton(); \ - magma_int_t info = 0; \ - \ - if (with_pivot) { \ - magma_dgesv_gpu(N, NRHS, reinterpret_cast(A.data()), \ - LDA, IPIV.data(), \ - reinterpret_cast(B.data()), LDB, \ - &info); \ - } else { \ - magma_dgesv_nopiv_gpu( \ - N, NRHS, reinterpret_cast(A.data()), LDA, \ - reinterpret_cast(B.data()), LDB, &info); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +template +void magmaGesvWrapper(const ExecSpace& space, const AViewType& A, + const BViewType& B, const IPIVViewType& IPIV) { + using scalar_type = typename AViewType::non_const_value_type; + + Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_MAGMA," + + Kokkos::ArithTraits::name() + "]"); + gesv_print_specialization(); + + const bool with_pivot = !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); + + magma_int_t N = static_cast(A.extent(1)); + magma_int_t AST = static_cast(A.stride(1)); + magma_int_t LDA = (AST == 0) ? 1 : AST; + magma_int_t BST = static_cast(B.stride(1)); + magma_int_t LDB = (BST == 0) ? 1 : BST; + magma_int_t NRHS = static_cast(B.extent(1)); + + KokkosLapack::Impl::MagmaSingleton& s = + KokkosLapack::Impl::MagmaSingleton::singleton(); + magma_int_t info = 0; + + space.fence(); + if constexpr (std::is_same_v) { + if (with_pivot) { + magma_sgesv_gpu(N, NRHS, reinterpret_cast(A.data()), LDA, + IPIV.data(), reinterpret_cast(B.data()), + LDB, &info); + } else { + magma_sgesv_nopiv_gpu(N, NRHS, reinterpret_cast(A.data()), + LDA, reinterpret_cast(B.data()), + LDB, &info); + } + } -#define KOKKOSLAPACK_SGESV_MAGMA(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ + if constexpr (std::is_same_v) { + if (with_pivot) { + magma_dgesv_gpu(N, NRHS, reinterpret_cast(A.data()), LDA, + IPIV.data(), reinterpret_cast(B.data()), + LDB, &info); + } else { + magma_dgesv_nopiv_gpu( + N, NRHS, reinterpret_cast(A.data()), LDA, + reinterpret_cast(B.data()), LDB, &info); + } + } + + if constexpr (std::is_same_v>) { + if (with_pivot) { + magma_cgesv_gpu( + N, NRHS, reinterpret_cast(A.data()), LDA, + IPIV.data(), reinterpret_cast(B.data()), LDB, + &info); + } else { + magma_cgesv_nopiv_gpu( + N, NRHS, reinterpret_cast(A.data()), LDA, + reinterpret_cast(B.data()), LDB, &info); + } + } + + if constexpr (std::is_same_v>) { + if (with_pivot) { + magma_zgesv_gpu( + N, NRHS, reinterpret_cast(A.data()), LDA, + IPIV.data(), reinterpret_cast(B.data()), LDB, + &info); + } else { + magma_zgesv_nopiv_gpu( + N, NRHS, reinterpret_cast(A.data()), LDA, + reinterpret_cast(B.data()), LDB, &info); + } + } + ExecSpace().fence(); + Kokkos::Profiling::popRegion(); +} + +#define KOKKOSLAPACK_GESV_MAGMA(SCALAR, LAYOUT, MEM_SPACE) \ + template <> \ struct GESV< \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef float SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ + Kokkos::MemoryTraits>, \ + true, \ + gesv_eti_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AViewType = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using BViewType = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using PViewType = Kokkos::View< \ magma_int_t*, LAYOUT, \ Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_MAGMA,float]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - magma_int_t N = static_cast(A.extent(1)); \ - magma_int_t AST = static_cast(A.stride(1)); \ - magma_int_t LDA = (AST == 0) ? 1 : AST; \ - magma_int_t BST = static_cast(B.stride(1)); \ - magma_int_t LDB = (BST == 0) ? 1 : BST; \ - magma_int_t NRHS = static_cast(B.extent(1)); \ - \ - KokkosLapack::Impl::MagmaSingleton& s = \ - KokkosLapack::Impl::MagmaSingleton::singleton(); \ - magma_int_t info = 0; \ + Kokkos::MemoryTraits>; \ \ - if (with_pivot) { \ - magma_sgesv_gpu(N, NRHS, reinterpret_cast(A.data()), \ - LDA, IPIV.data(), \ - reinterpret_cast(B.data()), LDB, \ - &info); \ - } else { \ - magma_sgesv_nopiv_gpu( \ - N, NRHS, reinterpret_cast(A.data()), LDA, \ - reinterpret_cast(B.data()), LDB, &info); \ - } \ - Kokkos::Profiling::popRegion(); \ + static void gesv(const Kokkos::Cuda& space, const AViewType& A, \ + const BViewType& B, const PViewType& IPIV) { \ + magmaGesvWrapper(space, A, B, IPIV); \ } \ }; -#define KOKKOSLAPACK_ZGESV_MAGMA(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ - struct GESV**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::complex SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - magma_int_t*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosLapack::gesv[TPL_MAGMA,complex]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - magma_int_t N = static_cast(A.extent(1)); \ - magma_int_t AST = static_cast(A.stride(1)); \ - magma_int_t LDA = (AST == 0) ? 1 : AST; \ - magma_int_t BST = static_cast(B.stride(1)); \ - magma_int_t LDB = (BST == 0) ? 1 : BST; \ - magma_int_t NRHS = static_cast(B.extent(1)); \ - \ - KokkosLapack::Impl::MagmaSingleton& s = \ - KokkosLapack::Impl::MagmaSingleton::singleton(); \ - magma_int_t info = 0; \ - \ - if (with_pivot) { \ - magma_zgesv_gpu( \ - N, NRHS, reinterpret_cast(A.data()), LDA, \ - IPIV.data(), reinterpret_cast(B.data()), \ - LDB, &info); \ - } else { \ - magma_zgesv_nopiv_gpu( \ - N, NRHS, reinterpret_cast(A.data()), LDA, \ - reinterpret_cast(B.data()), LDB, &info); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ - }; +KOKKOSLAPACK_GESV_MAGMA(float, Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_MAGMA(double, Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_MAGMA(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_MAGMA(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaSpace) -#define KOKKOSLAPACK_CGESV_MAGMA(LAYOUT, MEM_SPACE, ETI_SPEC_AVAIL) \ - template \ - struct GESV**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View**, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits >, \ - Kokkos::View, \ - Kokkos::MemoryTraits >, \ - true, ETI_SPEC_AVAIL> { \ - typedef Kokkos::complex SCALAR; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - AViewType; \ - typedef Kokkos::View, \ - Kokkos::MemoryTraits > \ - BViewType; \ - typedef Kokkos::View< \ - magma_int_t*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits > \ - PViewType; \ - \ - static void gesv(const AViewType& A, const BViewType& B, \ - const PViewType& IPIV) { \ - Kokkos::Profiling::pushRegion( \ - "KokkosLapack::gesv[TPL_MAGMA,complex]"); \ - gesv_print_specialization(); \ - const bool with_pivot = \ - !((IPIV.extent(0) == 0) && (IPIV.data() == nullptr)); \ - \ - magma_int_t N = static_cast(A.extent(1)); \ - magma_int_t AST = static_cast(A.stride(1)); \ - magma_int_t LDA = (AST == 0) ? 1 : AST; \ - magma_int_t BST = static_cast(B.stride(1)); \ - magma_int_t LDB = (BST == 0) ? 1 : BST; \ - magma_int_t NRHS = static_cast(B.extent(1)); \ - \ - KokkosLapack::Impl::MagmaSingleton& s = \ - KokkosLapack::Impl::MagmaSingleton::singleton(); \ - magma_int_t info = 0; \ - \ - if (with_pivot) { \ - magma_cgesv_gpu( \ - N, NRHS, reinterpret_cast(A.data()), LDA, \ - IPIV.data(), reinterpret_cast(B.data()), \ - LDB, &info); \ - } else { \ - magma_cgesv_nopiv_gpu( \ - N, NRHS, reinterpret_cast(A.data()), LDA, \ - reinterpret_cast(B.data()), LDB, &info); \ - } \ - Kokkos::Profiling::popRegion(); \ - } \ +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_MAGMA + +// CUSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER +#include "KokkosLapack_cusolver.hpp" + +namespace KokkosLapack { +namespace Impl { + +template +void cusolverGesvWrapper(const ExecutionSpace& space, const IPIVViewType& IPIV, + const AViewType& A, const BViewType& B) { + using memory_space = typename AViewType::memory_space; + using Scalar = typename BViewType::non_const_value_type; + using ALayout_t = typename AViewType::array_layout; + using BLayout_t = typename BViewType::array_layout; + + const int m = A.extent_int(0); + const int n = A.extent_int(1); + const int lda = std::is_same_v ? A.stride(0) + : A.stride(1); + + (void)B; + + const int nrhs = B.extent_int(1); + const int ldb = std::is_same_v ? B.stride(0) + : B.stride(1); + int lwork = 0; + Kokkos::View info("getrf info"); + + CudaLapackSingleton& s = CudaLapackSingleton::singleton(); + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnSetStream(s.handle, space.cuda_stream())); + if constexpr (std::is_same_v) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnSgetrf_bufferSize(s.handle, m, n, A.data(), lda, &lwork)); + Kokkos::View Workspace("getrf workspace", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnSgetrf(s.handle, m, n, A.data(), + lda, Workspace.data(), + IPIV.data(), info.data())); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnSgetrs(s.handle, CUBLAS_OP_N, m, nrhs, A.data(), lda, + IPIV.data(), B.data(), ldb, info.data())); + } + if constexpr (std::is_same_v) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnDgetrf_bufferSize(s.handle, m, n, A.data(), lda, &lwork)); + Kokkos::View Workspace("getrf workspace", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnDgetrf(s.handle, m, n, A.data(), + lda, Workspace.data(), + IPIV.data(), info.data())); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnDgetrs(s.handle, CUBLAS_OP_N, m, nrhs, A.data(), lda, + IPIV.data(), B.data(), ldb, info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnCgetrf_bufferSize( + s.handle, m, n, reinterpret_cast(A.data()), lda, &lwork)); + Kokkos::View Workspace("getrf workspace", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnCgetrf(s.handle, m, n, reinterpret_cast(A.data()), + lda, reinterpret_cast(Workspace.data()), + IPIV.data(), info.data())); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnCgetrs( + s.handle, CUBLAS_OP_N, m, nrhs, reinterpret_cast(A.data()), + lda, IPIV.data(), reinterpret_cast(B.data()), ldb, + info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnZgetrf_bufferSize( + s.handle, m, n, reinterpret_cast(A.data()), lda, + &lwork)); + Kokkos::View Workspace("getrf workspace", + lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnZgetrf( + s.handle, m, n, reinterpret_cast(A.data()), lda, + reinterpret_cast(Workspace.data()), IPIV.data(), + info.data())); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnZgetrs( + s.handle, CUBLAS_OP_N, m, nrhs, + reinterpret_cast(A.data()), lda, IPIV.data(), + reinterpret_cast(B.data()), ldb, info.data())); + } + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnSetStream(s.handle, NULL)); +} + +#define KOKKOSLAPACK_GESV_CUSOLVER(SCALAR, LAYOUT, MEM_SPACE) \ + template <> \ + struct GESV< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + gesv_eti_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AViewType = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using BViewType = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using PViewType = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + \ + static void gesv(const Kokkos::Cuda& space, const AViewType& A, \ + const BViewType& B, const PViewType& IPIV) { \ + Kokkos::Profiling::pushRegion("KokkosLapack::gesv[TPL_CUSOLVER," #SCALAR \ + "]"); \ + gesv_print_specialization(); \ + \ + cusolverGesvWrapper(space, IPIV, A, B); \ + Kokkos::Profiling::popRegion(); \ + } \ }; -KOKKOSLAPACK_DGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, true) -KOKKOSLAPACK_DGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, false) +KOKKOSLAPACK_GESV_CUSOLVER(float, Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_CUSOLVER(double, Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_GESV_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaSpace) -KOKKOSLAPACK_SGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, true) -KOKKOSLAPACK_SGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, false) +#if defined(KOKKOSKERNELS_INST_MEMSPACE_CUDAUVMSPACE) +KOKKOSLAPACK_GESV_CUSOLVER(float, Kokkos::LayoutLeft, Kokkos::CudaUVMSpace) +KOKKOSLAPACK_GESV_CUSOLVER(double, Kokkos::LayoutLeft, Kokkos::CudaUVMSpace) +KOKKOSLAPACK_GESV_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_GESV_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +#endif -KOKKOSLAPACK_ZGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, true) -KOKKOSLAPACK_ZGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, false) +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_CUSOLVER -KOKKOSLAPACK_CGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, true) -KOKKOSLAPACK_CGESV_MAGMA(Kokkos::LayoutLeft, Kokkos::CudaSpace, false) +// ROCSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER +#include +#include + +namespace KokkosLapack { +namespace Impl { + +template +void rocsolverGesvWrapper(const ExecutionSpace& space, const IPIVViewType& IPIV, + const AViewType& A, const BViewType& B) { + using Scalar = typename BViewType::non_const_value_type; + using ALayout_t = typename AViewType::array_layout; + using BLayout_t = typename BViewType::array_layout; + + const rocblas_int N = static_cast(A.extent(0)); + const rocblas_int nrhs = static_cast(B.extent(1)); + const rocblas_int lda = std::is_same_v + ? A.stride(0) + : A.stride(1); + const rocblas_int ldb = std::is_same_v + ? B.stride(0) + : B.stride(1); + Kokkos::View info("rocsolver info"); + + KokkosBlas::Impl::RocBlasSingleton& s = + KokkosBlas::Impl::RocBlasSingleton::singleton(); + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( + rocblas_set_stream(s.handle, space.hip_stream())); + if constexpr (std::is_same_v) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_sgesv(s.handle, N, nrhs, A.data(), + lda, IPIV.data(), B.data(), + ldb, info.data())); + } + if constexpr (std::is_same_v) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_dgesv(s.handle, N, nrhs, A.data(), + lda, IPIV.data(), B.data(), + ldb, info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_cgesv( + s.handle, N, nrhs, reinterpret_cast(A.data()), + lda, IPIV.data(), reinterpret_cast(B.data()), + ldb, info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_zgesv( + s.handle, N, nrhs, reinterpret_cast(A.data()), + lda, IPIV.data(), reinterpret_cast(B.data()), + ldb, info.data())); + } + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); +} + +#define KOKKOSLAPACK_GESV_ROCSOLVER(SCALAR, LAYOUT, MEM_SPACE) \ + template <> \ + struct GESV< \ + Kokkos::HIP, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + gesv_eti_spec_avail< \ + Kokkos::HIP, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AViewType = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using BViewType = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using PViewType = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + \ + static void gesv(const Kokkos::HIP& space, const AViewType& A, \ + const BViewType& B, const PViewType& IPIV) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosLapack::gesv[TPL_ROCSOLVER," #SCALAR "]"); \ + gesv_print_specialization(); \ + \ + rocsolverGesvWrapper(space, IPIV, A, B); \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +KOKKOSLAPACK_GESV_ROCSOLVER(float, Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSLAPACK_GESV_ROCSOLVER(double, Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSLAPACK_GESV_ROCSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSLAPACK_GESV_ROCSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::HIPSpace) } // namespace Impl } // namespace KokkosLapack -#endif // KOKKOSKERNELS_ENABLE_TPL_MAGMA +#endif // KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER #endif diff --git a/lapack/tpls/KokkosLapack_magma.hpp b/lapack/tpls/KokkosLapack_magma.hpp index 66529d73de..dfde113fa6 100644 --- a/lapack/tpls/KokkosLapack_magma.hpp +++ b/lapack/tpls/KokkosLapack_magma.hpp @@ -16,13 +16,16 @@ #ifndef KOKKOSLAPACK_MAGMA_HPP_ #define KOKKOSLAPACK_MAGMA_HPP_ -// If LAPACK TPL is enabled, it is preferred over magma's LAPACK + #ifdef KOKKOSKERNELS_ENABLE_TPL_MAGMA #include "magma_v2.h" namespace KokkosLapack { namespace Impl { +// Declaration of the singleton for cusolver +// this is the only header that needs to be +// included when using cusolverDn. struct MagmaSingleton { MagmaSingleton(); @@ -31,5 +34,6 @@ struct MagmaSingleton { } // namespace Impl } // namespace KokkosLapack -#endif // KOKKOSKERNELS_ENABLE_TPL_MAGMA +#endif + #endif // KOKKOSLAPACK_MAGMA_HPP_ diff --git a/lapack/tpls/KokkosLapack_svd_tpl_spec_avail.hpp b/lapack/tpls/KokkosLapack_svd_tpl_spec_avail.hpp new file mode 100644 index 0000000000..7a7403209f --- /dev/null +++ b/lapack/tpls/KokkosLapack_svd_tpl_spec_avail.hpp @@ -0,0 +1,171 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_HPP_ +#define KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_HPP_ + +namespace KokkosLapack { +namespace Impl { +// Specialization struct which defines whether a specialization exists +template +struct svd_tpl_spec_avail { + enum : bool { value = false }; +}; + +// LAPACK +#if defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) || \ + defined(KOKKOSKERNELS_ENABLE_TPL_MKL) +#define KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(SCALAR, LAYOUT, EXECSPACE) \ + template <> \ + struct svd_tpl_spec_avail< \ + EXECSPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ + }; + +#if defined(KOKKOS_ENABLE_SERIAL) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(float, Kokkos::LayoutLeft, + Kokkos::Serial) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(double, Kokkos::LayoutLeft, + Kokkos::Serial) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Serial) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Serial) +#endif + +#if defined(KOKKOS_ENABLE_OPENMP) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(float, Kokkos::LayoutLeft, + Kokkos::OpenMP) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(double, Kokkos::LayoutLeft, + Kokkos::OpenMP) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::OpenMP) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::OpenMP) +#endif + +#if defined(KOKKOS_ENABLE_THREADS) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(float, Kokkos::LayoutLeft, + Kokkos::Threads) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(double, Kokkos::LayoutLeft, + Kokkos::Threads) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Threads) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_LAPACK(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::Threads) +#endif + +#endif // KOKKOSKERNELS_ENABLE_TPL_LAPACK || KOKKOSKERNELS_ENABLE_TPL_MKL + +// CUSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER +#define KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ + struct svd_tpl_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ + }; + +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(float, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(double, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::CudaSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_CUDAUVMSPACE) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(float, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(double, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_CUSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +#endif // CUDAUVMSPACE +#endif // CUSOLVER + +// ROCSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER +#define KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(SCALAR, LAYOUT, MEMSPACE) \ + template <> \ + struct svd_tpl_spec_avail< \ + Kokkos::HIP, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ + }; + +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(float, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(double, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, Kokkos::HIPSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_HIPMANAGEDSPACE) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(float, Kokkos::LayoutLeft, + Kokkos::HIPManagedSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(double, Kokkos::LayoutLeft, + Kokkos::HIPManagedSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, + Kokkos::HIPManagedSpace) +KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_ROCSOLVER(Kokkos::complex, + Kokkos::LayoutLeft, + Kokkos::HIPManagedSpace) +#endif // HIPMANAGEDSPACE +#endif // ROCSOLVER + +} // namespace Impl +} // namespace KokkosLapack + +#endif // KOKKOSLAPACK_SVD_TPL_SPEC_AVAIL_HPP_ diff --git a/lapack/tpls/KokkosLapack_svd_tpl_spec_decl.hpp b/lapack/tpls/KokkosLapack_svd_tpl_spec_decl.hpp new file mode 100644 index 0000000000..4385fa40d6 --- /dev/null +++ b/lapack/tpls/KokkosLapack_svd_tpl_spec_decl.hpp @@ -0,0 +1,688 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSLAPACK_SVD_TPL_SPEC_DECL_HPP_ +#define KOKKOSLAPACK_SVD_TPL_SPEC_DECL_HPP_ + +#include "KokkosKernels_Error.hpp" +#include "Kokkos_ArithTraits.hpp" + +namespace KokkosLapack { +namespace Impl { +template +inline void svd_print_specialization() { +#ifdef KOKKOSKERNELS_ENABLE_CHECK_SPECIALIZATION +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER + if constexpr (std::is_same_v) { + printf( + "KokkosLapack::svd<> TPL Cusolver specialization for < %s , %s, %s, %s " + ">\n", + typeid(AMatrix).name(), typeid(SVector).name(), typeid(UMatrix).name(), + typeid(VMatrix).name()); + } +#endif +#endif +} +} // namespace Impl +} // namespace KokkosLapack + +// LAPACK +#if defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) && \ + !defined(KOKKOSKERNELS_ENABLE_TPL_MKL) +#include "KokkosLapack_Host_tpl.hpp" + +namespace KokkosLapack { +namespace Impl { + +template +void lapackSvdWrapper(const ExecutionSpace& /* space */, const char jobu[], + const char jobvt[], const AMatrix& A, const SVector& S, + const UMatrix& U, const VMatrix& Vt) { + using memory_space = typename AMatrix::memory_space; + using Scalar = typename AMatrix::non_const_value_type; + using Magnitude = typename SVector::non_const_value_type; + using ALayout_t = typename AMatrix::array_layout; + using ULayout_t = typename UMatrix::array_layout; + using VLayout_t = typename VMatrix::array_layout; + + static_assert(std::is_same_v, + "KokkosLapack - svd: A needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: U needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: Vt needs to have a Kokkos::LayoutLeft"); + + const int m = A.extent_int(0); + const int n = A.extent_int(1); + const int lda = A.stride(1); + const int ldu = U.stride(1); + const int ldvt = Vt.stride(1); + + int lwork = -1, info = 0; + Kokkos::View rwork("svd rwork buffer", + 5 * Kokkos::min(m, n)); + Kokkos::View work("svd work buffer", 1); + if constexpr (Kokkos::ArithTraits::is_complex) { + HostLapack>::gesvd( + jobu[0], jobvt[0], m, n, + reinterpret_cast*>(A.data()), lda, S.data(), + reinterpret_cast*>(U.data()), ldu, + reinterpret_cast*>(Vt.data()), ldvt, + reinterpret_cast*>(work.data()), lwork, + rwork.data(), info); + + lwork = static_cast(work(0).real()); + + work = Kokkos::View("svd work buffer", lwork); + HostLapack>::gesvd( + jobu[0], jobvt[0], m, n, + reinterpret_cast*>(A.data()), lda, S.data(), + reinterpret_cast*>(U.data()), ldu, + reinterpret_cast*>(Vt.data()), ldvt, + reinterpret_cast*>(work.data()), lwork, + rwork.data(), info); + } else { + HostLapack::gesvd(jobu[0], jobvt[0], m, n, A.data(), lda, S.data(), + U.data(), ldu, Vt.data(), ldvt, work.data(), + lwork, rwork.data(), info); + + lwork = static_cast(work(0)); + + work = Kokkos::View("svd work buffer", lwork); + HostLapack::gesvd(jobu[0], jobvt[0], m, n, A.data(), lda, S.data(), + U.data(), ldu, Vt.data(), ldvt, work.data(), + lwork, rwork.data(), info); + } +} + +#define KOKKOSLAPACK_SVD_LAPACK(SCALAR, LAYOUT, EXEC_SPACE) \ + template <> \ + struct SVD< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + svd_eti_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using SVector = \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>; \ + using UMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using VMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + \ + static void svd(const EXEC_SPACE& space, const char jobu[], \ + const char jobvt[], const AMatrix& A, const SVector& S, \ + const UMatrix& U, const VMatrix& Vt) { \ + Kokkos::Profiling::pushRegion("KokkosLapack::svd[TPL_LAPACK," #SCALAR \ + "]"); \ + svd_print_specialization(); \ + \ + lapackSvdWrapper(space, jobu, jobvt, A, S, U, Vt); \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#if defined(KOKKOS_ENABLE_SERIAL) +KOKKOSLAPACK_SVD_LAPACK(float, Kokkos::LayoutLeft, Kokkos::Serial) +KOKKOSLAPACK_SVD_LAPACK(double, Kokkos::LayoutLeft, Kokkos::Serial) +KOKKOSLAPACK_SVD_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial) +KOKKOSLAPACK_SVD_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial) +#endif + +#if defined(KOKKOS_ENABLE_OPENMP) +KOKKOSLAPACK_SVD_LAPACK(float, Kokkos::LayoutLeft, Kokkos::OpenMP) +KOKKOSLAPACK_SVD_LAPACK(double, Kokkos::LayoutLeft, Kokkos::OpenMP) +KOKKOSLAPACK_SVD_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP) +KOKKOSLAPACK_SVD_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP) +#endif + +#if defined(KOKKOS_ENABLE_THREADS) +KOKKOSLAPACK_SVD_LAPACK(float, Kokkos::LayoutLeft, Kokkos::Threads) +KOKKOSLAPACK_SVD_LAPACK(double, Kokkos::LayoutLeft, Kokkos::Threads) +KOKKOSLAPACK_SVD_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads) +KOKKOSLAPACK_SVD_LAPACK(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads) +#endif + +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_LAPACK + +#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL +#include "mkl.h" + +namespace KokkosLapack { +namespace Impl { + +template +void mklSvdWrapper(const ExecutionSpace& /* space */, const char jobu[], + const char jobvt[], const AMatrix& A, const SVector& S, + const UMatrix& U, const VMatrix& Vt) { + using memory_space = typename AMatrix::memory_space; + using Scalar = typename AMatrix::non_const_value_type; + using Magnitude = typename SVector::non_const_value_type; + using ALayout_t = typename AMatrix::array_layout; + using ULayout_t = typename UMatrix::array_layout; + using VLayout_t = typename VMatrix::array_layout; + + static_assert(std::is_same_v, + "KokkosLapack - svd: A needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: U needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: Vt needs to have a Kokkos::LayoutLeft"); + + const lapack_int m = A.extent_int(0); + const lapack_int n = A.extent_int(1); + const lapack_int lda = A.stride(1); + const lapack_int ldu = U.stride(1); + const lapack_int ldvt = Vt.stride(1); + + Kokkos::View rwork("svd rwork buffer", + Kokkos::min(m, n) - 1); + lapack_int ret = 0; + if constexpr (std::is_same_v) { + ret = + LAPACKE_sgesvd(LAPACK_COL_MAJOR, jobu[0], jobvt[0], m, n, A.data(), lda, + S.data(), U.data(), ldu, Vt.data(), ldvt, rwork.data()); + } + if constexpr (std::is_same_v) { + ret = + LAPACKE_dgesvd(LAPACK_COL_MAJOR, jobu[0], jobvt[0], m, n, A.data(), lda, + S.data(), U.data(), ldu, Vt.data(), ldvt, rwork.data()); + } + if constexpr (std::is_same_v>) { + ret = LAPACKE_cgesvd( + LAPACK_COL_MAJOR, jobu[0], jobvt[0], m, n, + reinterpret_cast(A.data()), lda, S.data(), + reinterpret_cast(U.data()), ldu, + reinterpret_cast(Vt.data()), ldvt, rwork.data()); + } + if constexpr (std::is_same_v>) { + ret = LAPACKE_zgesvd( + LAPACK_COL_MAJOR, jobu[0], jobvt[0], m, n, + reinterpret_cast(A.data()), lda, S.data(), + reinterpret_cast(U.data()), ldu, + reinterpret_cast(Vt.data()), ldvt, + rwork.data()); + } + + if (ret != 0) { + std::ostringstream os; + os << "KokkosLapack::svd: MKL failed with return value: " << ret << "\n"; + KokkosKernels::Impl::throw_runtime_exception(os.str()); + } +} + +#define KOKKOSLAPACK_SVD_MKL(SCALAR, LAYOUT, EXEC_SPACE) \ + template <> \ + struct SVD< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + svd_eti_spec_avail< \ + EXEC_SPACE, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using SVector = \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>; \ + using UMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using VMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + \ + static void svd(const EXEC_SPACE& space, const char jobu[], \ + const char jobvt[], const AMatrix& A, const SVector& S, \ + const UMatrix& U, const VMatrix& Vt) { \ + Kokkos::Profiling::pushRegion("KokkosLapack::svd[TPL_LAPACK," #SCALAR \ + "]"); \ + svd_print_specialization(); \ + \ + mklSvdWrapper(space, jobu, jobvt, A, S, U, Vt); \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#if defined(KOKKOS_ENABLE_SERIAL) +KOKKOSLAPACK_SVD_MKL(float, Kokkos::LayoutLeft, Kokkos::Serial) +KOKKOSLAPACK_SVD_MKL(double, Kokkos::LayoutLeft, Kokkos::Serial) +KOKKOSLAPACK_SVD_MKL(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::Serial) +KOKKOSLAPACK_SVD_MKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Serial) +#endif + +#if defined(KOKKOS_ENABLE_OPENMP) +KOKKOSLAPACK_SVD_MKL(float, Kokkos::LayoutLeft, Kokkos::OpenMP) +KOKKOSLAPACK_SVD_MKL(double, Kokkos::LayoutLeft, Kokkos::OpenMP) +KOKKOSLAPACK_SVD_MKL(Kokkos::complex, Kokkos::LayoutLeft, Kokkos::OpenMP) +KOKKOSLAPACK_SVD_MKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::OpenMP) +#endif + +#if defined(KOKKOS_ENABLE_THREADS) +KOKKOSLAPACK_SVD_MKL(float, Kokkos::LayoutLeft, Kokkos::Threads) +KOKKOSLAPACK_SVD_MKL(double, Kokkos::LayoutLeft, Kokkos::Threads) +KOKKOSLAPACK_SVD_MKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads) +KOKKOSLAPACK_SVD_MKL(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::Threads) +#endif + +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_MKL + +// CUSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSOLVER +#include "KokkosLapack_cusolver.hpp" + +namespace KokkosLapack { +namespace Impl { + +template +void cusolverSvdWrapper(const ExecutionSpace& space, const char jobu[], + const char jobvt[], const AMatrix& A, const SVector& S, + const UMatrix& U, const VMatrix& Vt) { + using memory_space = typename AMatrix::memory_space; + using Scalar = typename AMatrix::non_const_value_type; + using Magnitude = typename SVector::non_const_value_type; + using ALayout_t = typename AMatrix::array_layout; + using ULayout_t = typename UMatrix::array_layout; + using VLayout_t = typename VMatrix::array_layout; + + static_assert(std::is_same_v, + "KokkosLapack - svd: A needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: U needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: Vt needs to have a Kokkos::LayoutLeft"); + + const int m = A.extent_int(0); + const int n = A.extent_int(1); + const int lda = A.stride(1); + const int ldu = U.stride(1); + const int ldvt = Vt.stride(1); + + int lwork = 0; + Kokkos::View info("svd info"); + Kokkos::View rwork("svd rwork buffer", + Kokkos::min(m, n) - 1); + + CudaLapackSingleton& s = CudaLapackSingleton::singleton(); + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnSetStream(s.handle, space.cuda_stream())); + if constexpr (std::is_same_v) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnSgesvd_bufferSize(s.handle, m, n, &lwork)); + Kokkos::View work("svd work buffer", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnSgesvd( + s.handle, jobu[0], jobvt[0], m, n, A.data(), lda, S.data(), U.data(), + ldu, Vt.data(), ldvt, work.data(), lwork, rwork.data(), info.data())); + } + if constexpr (std::is_same_v) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnDgesvd_bufferSize(s.handle, m, n, &lwork)); + Kokkos::View work("svd work buffer", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnDgesvd( + s.handle, jobu[0], jobvt[0], m, n, A.data(), lda, S.data(), U.data(), + ldu, Vt.data(), ldvt, work.data(), lwork, rwork.data(), info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnCgesvd_bufferSize(s.handle, m, n, &lwork)); + Kokkos::View work("svd work buffer", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnCgesvd(s.handle, jobu[0], jobvt[0], m, n, + reinterpret_cast(A.data()), lda, S.data(), + reinterpret_cast(U.data()), ldu, + reinterpret_cast(Vt.data()), ldvt, + reinterpret_cast(work.data()), lwork, + rwork.data(), info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnZgesvd_bufferSize(s.handle, m, n, &lwork)); + Kokkos::View work("svd work buffer", lwork); + + KOKKOS_CUSOLVER_SAFE_CALL_IMPL( + cusolverDnZgesvd(s.handle, jobu[0], jobvt[0], m, n, + reinterpret_cast(A.data()), lda, + S.data(), reinterpret_cast(U.data()), + ldu, reinterpret_cast(Vt.data()), + ldvt, reinterpret_cast(work.data()), + lwork, rwork.data(), info.data())); + } + KOKKOS_CUSOLVER_SAFE_CALL_IMPL(cusolverDnSetStream(s.handle, NULL)); +} + +#define KOKKOSLAPACK_SVD_CUSOLVER(SCALAR, LAYOUT, MEM_SPACE) \ + template <> \ + struct SVD< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + svd_eti_spec_avail< \ + Kokkos::Cuda, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AMatrix = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using SVector = \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>; \ + using UMatrix = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using VMatrix = Kokkos::View, \ + Kokkos::MemoryTraits>; \ + \ + static void svd(const Kokkos::Cuda& space, const char jobu[], \ + const char jobvt[], const AMatrix& A, const SVector& S, \ + const UMatrix& U, const VMatrix& Vt) { \ + Kokkos::Profiling::pushRegion("KokkosLapack::svd[TPL_CUSOLVER," #SCALAR \ + "]"); \ + svd_print_specialization(); \ + \ + cusolverSvdWrapper(space, jobu, jobvt, A, S, U, Vt); \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +KOKKOSLAPACK_SVD_CUSOLVER(float, Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_SVD_CUSOLVER(double, Kokkos::LayoutLeft, Kokkos::CudaSpace) +KOKKOSLAPACK_SVD_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaSpace) +KOKKOSLAPACK_SVD_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_CUDAUVMSPACE) +KOKKOSLAPACK_SVD_CUSOLVER(float, Kokkos::LayoutLeft, Kokkos::CudaUVMSpace) +KOKKOSLAPACK_SVD_CUSOLVER(double, Kokkos::LayoutLeft, Kokkos::CudaUVMSpace) +KOKKOSLAPACK_SVD_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +KOKKOSLAPACK_SVD_CUSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::CudaUVMSpace) +#endif + +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_CUSOLVER + +// ROCSOLVER +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER +#include +#include + +namespace KokkosLapack { +namespace Impl { + +template +void rocsolverSvdWrapper(const ExecutionSpace& space, const char jobu[], + const char jobvt[], const AMatrix& A, const SVector& S, + const UMatrix& U, const VMatrix& Vt) { + using memory_space = typename AMatrix::memory_space; + using Scalar = typename AMatrix::non_const_value_type; + using Magnitude = typename SVector::non_const_value_type; + using ALayout_t = typename AMatrix::array_layout; + using ULayout_t = typename UMatrix::array_layout; + using VLayout_t = typename VMatrix::array_layout; + + static_assert(std::is_same_v, + "KokkosLapack - svd: A needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: U needs to have a Kokkos::LayoutLeft"); + static_assert(std::is_same_v, + "KokkosLapack - svd: Vt needs to have a Kokkos::LayoutLeft"); + + const rocblas_int m = A.extent_int(0); + const rocblas_int n = A.extent_int(1); + const rocblas_int lda = A.stride(1); + const rocblas_int ldu = U.stride(1); + const rocblas_int ldvt = Vt.stride(1); + + rocblas_svect UVecMode = rocblas_svect_all; + if ((jobu[0] == 'S') || (jobu[0] == 's')) { + UVecMode = rocblas_svect_singular; + } else if ((jobu[0] == 'O') || (jobu[0] == 'o')) { + UVecMode = rocblas_svect_overwrite; + } else if ((jobu[0] == 'N') || (jobu[0] == 'n')) { + UVecMode = rocblas_svect_none; + } + rocblas_svect VVecMode = rocblas_svect_all; + if ((jobvt[0] == 'S') || (jobvt[0] == 's')) { + VVecMode = rocblas_svect_singular; + } else if ((jobvt[0] == 'O') || (jobvt[0] == 'o')) { + VVecMode = rocblas_svect_overwrite; + } else if ((jobvt[0] == 'N') || (jobvt[0] == 'n')) { + VVecMode = rocblas_svect_none; + } + + const rocblas_workmode WorkMode = rocblas_outofplace; + + Kokkos::View info("svd info"); + Kokkos::View rwork("svd rwork buffer", + Kokkos::min(m, n) - 1); + + KokkosBlas::Impl::RocBlasSingleton& s = + KokkosBlas::Impl::RocBlasSingleton::singleton(); + KOKKOS_ROCBLAS_SAFE_CALL_IMPL( + rocblas_set_stream(s.handle, space.hip_stream())); + if constexpr (std::is_same_v) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_sgesvd( + s.handle, UVecMode, VVecMode, m, n, A.data(), lda, S.data(), U.data(), + ldu, Vt.data(), ldvt, rwork.data(), WorkMode, info.data())); + } + if constexpr (std::is_same_v) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_dgesvd( + s.handle, UVecMode, VVecMode, m, n, A.data(), lda, S.data(), U.data(), + ldu, Vt.data(), ldvt, rwork.data(), WorkMode, info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_cgesvd( + s.handle, UVecMode, VVecMode, m, n, + reinterpret_cast(A.data()), lda, S.data(), + reinterpret_cast(U.data()), ldu, + reinterpret_cast(Vt.data()), ldvt, rwork.data(), + WorkMode, info.data())); + } + if constexpr (std::is_same_v>) { + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocsolver_zgesvd( + s.handle, UVecMode, VVecMode, m, n, + reinterpret_cast(A.data()), lda, S.data(), + reinterpret_cast(U.data()), ldu, + reinterpret_cast(Vt.data()), ldvt, + rwork.data(), WorkMode, info.data())); + } + KOKKOS_ROCBLAS_SAFE_CALL_IMPL(rocblas_set_stream(s.handle, NULL)); +} + +#define KOKKOSLAPACK_SVD_ROCSOLVER(SCALAR, LAYOUT, MEM_SPACE) \ + template <> \ + struct SVD< \ + Kokkos::HIP, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, \ + svd_eti_spec_avail< \ + Kokkos::HIP, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>>::value> { \ + using AMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using SVector = \ + Kokkos::View::mag_type*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>; \ + using UMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using VMatrix = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + \ + static void svd(const Kokkos::HIP& space, const char jobu[], \ + const char jobvt[], const AMatrix& A, const SVector& S, \ + const UMatrix& U, const VMatrix& Vt) { \ + Kokkos::Profiling::pushRegion("KokkosLapack::svd[TPL_ROCSOLVER," #SCALAR \ + "]"); \ + svd_print_specialization(); \ + \ + rocsolverSvdWrapper(space, jobu, jobvt, A, S, U, Vt); \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +KOKKOSLAPACK_SVD_ROCSOLVER(float, Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSLAPACK_SVD_ROCSOLVER(double, Kokkos::LayoutLeft, Kokkos::HIPSpace) +KOKKOSLAPACK_SVD_ROCSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::HIPSpace) +KOKKOSLAPACK_SVD_ROCSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::HIPSpace) + +#if defined(KOKKOSKERNELS_INST_MEMSPACE_HIPMANAGEDSPACE) +KOKKOSLAPACK_SVD_ROCSOLVER(float, Kokkos::LayoutLeft, Kokkos::HIPManagedSpace) +KOKKOSLAPACK_SVD_ROCSOLVER(double, Kokkos::LayoutLeft, Kokkos::HIPManagedSpace) +KOKKOSLAPACK_SVD_ROCSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::HIPManagedSpace) +KOKKOSLAPACK_SVD_ROCSOLVER(Kokkos::complex, Kokkos::LayoutLeft, + Kokkos::HIPManagedSpace) +#endif + +} // namespace Impl +} // namespace KokkosLapack +#endif // KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER + +#endif // KOKKOSLAPACK_SVD_TPL_SPEC_DECL_HPP_ diff --git a/lapack/tpls/KokkosLapack_trtri_tpl_spec_decl.hpp b/lapack/tpls/KokkosLapack_trtri_tpl_spec_decl.hpp index 655b5b8579..b7e9c6e341 100644 --- a/lapack/tpls/KokkosLapack_trtri_tpl_spec_decl.hpp +++ b/lapack/tpls/KokkosLapack_trtri_tpl_spec_decl.hpp @@ -18,6 +18,7 @@ #define KOKKOSLAPACK_TRTRI_TPL_SPEC_DECL_HPP_ #include "KokkosLapack_Host_tpl.hpp" // trtri prototype + #ifdef KOKKOSKERNELS_ENABLE_TPL_MAGMA #include "KokkosLapack_magma.hpp" #endif diff --git a/lapack/unit_test/Test_Lapack.hpp b/lapack/unit_test/Test_Lapack.hpp index 815c442884..1a717521f8 100644 --- a/lapack/unit_test/Test_Lapack.hpp +++ b/lapack/unit_test/Test_Lapack.hpp @@ -18,5 +18,6 @@ #include "Test_Lapack_gesv.hpp" #include "Test_Lapack_trtri.hpp" +#include "Test_Lapack_svd.hpp" #endif // TEST_LAPACK_HPP diff --git a/lapack/unit_test/Test_Lapack_gesv.hpp b/lapack/unit_test/Test_Lapack_gesv.hpp index 06f51b7eb0..77774d1d3f 100644 --- a/lapack/unit_test/Test_Lapack_gesv.hpp +++ b/lapack/unit_test/Test_Lapack_gesv.hpp @@ -15,13 +15,15 @@ //@HEADER // only enable this test where KokkosLapack supports gesv: -// CUDA+MAGMA and HOST+LAPACK -#if (defined(TEST_CUDA_LAPACK_CPP) && \ - defined(KOKKOSKERNELS_ENABLE_TPL_MAGMA)) || \ - (defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) && \ - (defined(TEST_OPENMP_LAPACK_CPP) || \ - defined(TEST_OPENMPTARGET_LAPACK_CPP) || \ - defined(TEST_SERIAL_LAPACK_CPP) || defined(TEST_THREADS_LAPACK_CPP))) +// CUDA+(MAGMA or CUSOLVER), HIP+ROCSOLVER and HOST+LAPACK +#if (defined(TEST_CUDA_LAPACK_CPP) && \ + (defined(KOKKOSKERNELS_ENABLE_TPL_MAGMA) || \ + defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER))) || \ + (defined(TEST_HIP_LAPACK_CPP) && \ + defined(KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER)) || \ + (defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) && \ + (defined(TEST_OPENMP_LAPACK_CPP) || defined(TEST_SERIAL_LAPACK_CPP) || \ + defined(TEST_THREADS_LAPACK_CPP))) #include #include @@ -34,11 +36,13 @@ namespace Test { -template +template void impl_test_gesv(const char* mode, const char* padding, int N) { - typedef typename Device::execution_space execution_space; - typedef typename ViewTypeA::value_type ScalarA; - typedef Kokkos::ArithTraits ats; + using execution_space = typename Device::execution_space; + using ScalarA = typename ViewTypeA::value_type; + using ats = Kokkos::ArithTraits; + + execution_space space{}; Kokkos::Random_XorShift64_Pool rand_pool(13718); @@ -80,7 +84,9 @@ void impl_test_gesv(const char* mode, const char* padding, int N) { Kokkos::deep_copy(h_X0, X0); // Allocate IPIV view on host - typedef Kokkos::View ViewTypeP; + using ViewTypeP = typename std::conditional< + MAGMA, Kokkos::View, + Kokkos::View>::type; ViewTypeP ipiv; int Nt = 0; if (mode[0] == 'Y') { @@ -90,7 +96,7 @@ void impl_test_gesv(const char* mode, const char* padding, int N) { // Solve. try { - KokkosLapack::gesv(A, B, ipiv); + KokkosLapack::gesv(space, A, B, ipiv); } catch (const std::runtime_error& error) { // Check for expected runtime errors due to: // no-pivoting case (note: only MAGMA supports no-pivoting interface) @@ -124,26 +130,30 @@ void impl_test_gesv(const char* mode, const char* padding, int N) { // Checking vs ref on CPU, this eps is about 10^-9 typedef typename ats::mag_type mag_type; - const mag_type eps = 1.0e7 * ats::epsilon(); + const mag_type eps = 3.0e7 * ats::epsilon(); bool test_flag = true; for (int i = 0; i < N; i++) { if (ats::abs(h_B(i) - h_X0(i)) > eps) { test_flag = false; - // printf( " Error %d, pivot %c, padding %c: result( %.15lf ) != - // solution( %.15lf ) at (%d)\n", N, mode[0], padding[0], - // ats::abs(h_B(i)), ats::abs(h_X0(i)), int(i) ); - // break; + printf( + " Error %d, pivot %c, padding %c: result( %.15lf ) !=" + "solution( %.15lf ) at (%d), error=%.15e, eps=%.15e\n", + N, mode[0], padding[0], ats::abs(h_B(i)), ats::abs(h_X0(i)), int(i), + ats::abs(h_B(i) - h_X0(i)), eps); + break; } } ASSERT_EQ(test_flag, true); } -template +template void impl_test_gesv_mrhs(const char* mode, const char* padding, int N, int nrhs) { - typedef typename Device::execution_space execution_space; - typedef typename ViewTypeA::value_type ScalarA; - typedef Kokkos::ArithTraits ats; + using execution_space = typename Device::execution_space; + using ScalarA = typename ViewTypeA::value_type; + using ats = Kokkos::ArithTraits; + + execution_space space{}; Kokkos::Random_XorShift64_Pool rand_pool(13718); @@ -185,7 +195,9 @@ void impl_test_gesv_mrhs(const char* mode, const char* padding, int N, Kokkos::deep_copy(h_X0, X0); // Allocate IPIV view on host - typedef Kokkos::View ViewTypeP; + using ViewTypeP = typename std::conditional< + MAGMA, Kokkos::View, + Kokkos::View>::type; ViewTypeP ipiv; int Nt = 0; if (mode[0] == 'Y') { @@ -195,7 +207,7 @@ void impl_test_gesv_mrhs(const char* mode, const char* padding, int N, // Solve. try { - KokkosLapack::gesv(A, B, ipiv); + KokkosLapack::gesv(space, A, B, ipiv); } catch (const std::runtime_error& error) { // Check for expected runtime errors due to: // no-pivoting case (note: only MAGMA supports no-pivoting interface) @@ -253,41 +265,51 @@ int test_gesv(const char* mode) { #if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ (!defined(KOKKOSKERNELS_ETI_ONLY) && \ !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) - typedef Kokkos::View view_type_a_ll; - typedef Kokkos::View view_type_b_ll; - Test::impl_test_gesv( + using view_type_a_ll = Kokkos::View; + using view_type_b_ll = Kokkos::View; + +#if (defined(TEST_CUDA_LAPACK_CPP) && \ + defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER)) || \ + (defined(TEST_HIP_LAPACK_CPP) && \ + defined(KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER)) || \ + (defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) && \ + (defined(TEST_OPENMP_LAPACK_CPP) || defined(TEST_SERIAL_LAPACK_CPP) || \ + defined(TEST_THREADS_LAPACK_CPP))) + Test::impl_test_gesv( &mode[0], "N", 2); // no padding - Test::impl_test_gesv( + Test::impl_test_gesv( &mode[0], "N", 13); // no padding - Test::impl_test_gesv( + Test::impl_test_gesv( &mode[0], "N", 179); // no padding - Test::impl_test_gesv( + Test::impl_test_gesv( &mode[0], "N", 64); // no padding - Test::impl_test_gesv( + Test::impl_test_gesv( &mode[0], "N", 1024); // no padding - Test::impl_test_gesv(&mode[0], "Y", - 13); // padding - Test::impl_test_gesv(&mode[0], "Y", - 179); // padding + +#elif defined(KOKKOSKERNELS_ENABLE_TPL_MAGMA) && defined(KOKKOS_ENABLE_CUDA) + if constexpr (std::is_same_v) { + Test::impl_test_gesv( + &mode[0], "N", 2); // no padding + Test::impl_test_gesv( + &mode[0], "N", 13); // no padding + Test::impl_test_gesv( + &mode[0], "N", 179); // no padding + Test::impl_test_gesv( + &mode[0], "N", 64); // no padding + Test::impl_test_gesv( + &mode[0], "N", 1024); // no padding + + Test::impl_test_gesv( + &mode[0], "Y", + 13); // padding + Test::impl_test_gesv( + &mode[0], "Y", + 179); // padding + } +#endif #endif - /* - #if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || - (!defined(KOKKOSKERNELS_ETI_ONLY) && - !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) typedef Kokkos::View view_type_a_lr; typedef Kokkos::View view_type_b_lr; - Test::impl_test_gesv(&mode[0], "N", - 2); //no padding Test::impl_test_gesv(&mode[0], "N", 13); //no padding Test::impl_test_gesv(&mode[0], "N", 179); //no padding - Test::impl_test_gesv(&mode[0], "N", - 64); //no padding Test::impl_test_gesv(&mode[0], "N", 1024);//no padding Test::impl_test_gesv(&mode[0], "Y", 13); //padding - Test::impl_test_gesv(&mode[0], "Y", - 179); //padding #endif - */ // Supress unused parameters on CUDA10 (void)mode; return 1; @@ -298,42 +320,50 @@ int test_gesv_mrhs(const char* mode) { #if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ (!defined(KOKKOSKERNELS_ETI_ONLY) && \ !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) - typedef Kokkos::View view_type_a_ll; - typedef Kokkos::View view_type_b_ll; - Test::impl_test_gesv_mrhs( + using view_type_a_ll = Kokkos::View; + using view_type_b_ll = Kokkos::View; + +#if (defined(TEST_CUDA_LAPACK_CPP) && \ + defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER)) || \ + (defined(TEST_HIP_LAPACK_CPP) && \ + defined(KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER)) || \ + (defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) && \ + (defined(TEST_OPENMP_LAPACK_CPP) || defined(TEST_SERIAL_LAPACK_CPP) || \ + defined(TEST_THREADS_LAPACK_CPP))) + Test::impl_test_gesv_mrhs( &mode[0], "N", 2, 5); // no padding - Test::impl_test_gesv_mrhs( + Test::impl_test_gesv_mrhs( &mode[0], "N", 13, 5); // no padding - Test::impl_test_gesv_mrhs( + Test::impl_test_gesv_mrhs( &mode[0], "N", 179, 5); // no padding - Test::impl_test_gesv_mrhs( + Test::impl_test_gesv_mrhs( &mode[0], "N", 64, 5); // no padding - Test::impl_test_gesv_mrhs( + Test::impl_test_gesv_mrhs( &mode[0], "N", 1024, 5); // no padding - Test::impl_test_gesv_mrhs( - &mode[0], "Y", 13, 5); // padding - Test::impl_test_gesv_mrhs( - &mode[0], "Y", 179, 5); // padding + +// When appropriate run MAGMA specific tests +#elif defined(KOKKOSKERNELS_ENABLE_TPL_MAGMA) && defined(KOKKOS_ENABLE_CUDA) + if constexpr (std::is_same_v) { + Test::impl_test_gesv_mrhs( + &mode[0], "N", 2, 5); // no padding + Test::impl_test_gesv_mrhs( + &mode[0], "N", 13, 5); // no padding + Test::impl_test_gesv_mrhs( + &mode[0], "N", 179, 5); // no padding + Test::impl_test_gesv_mrhs( + &mode[0], "N", 64, 5); // no padding + Test::impl_test_gesv_mrhs( + &mode[0], "N", 1024, 5); // no padding + + Test::impl_test_gesv_mrhs( + &mode[0], "Y", 13, 5); // padding + Test::impl_test_gesv_mrhs( + &mode[0], "Y", 179, 5); // padding + } +#endif #endif - /* - #if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || - (!defined(KOKKOSKERNELS_ETI_ONLY) && - !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) typedef Kokkos::View view_type_a_lr; typedef Kokkos::View view_type_b_lr; - Test::impl_test_gesv_mrhs(&mode[0], - "N", 2, 5);//no padding Test::impl_test_gesv_mrhs(&mode[0], "N", 13, 5);//no padding - Test::impl_test_gesv_mrhs(&mode[0], - "N", 179, 5);//no padding Test::impl_test_gesv_mrhs(&mode[0], "N", 64, 5);//no padding - Test::impl_test_gesv_mrhs(&mode[0], - "N", 1024,5);//no padding Test::impl_test_gesv_mrhs(&mode[0], "Y", 13, 5);//padding - Test::impl_test_gesv_mrhs(&mode[0], - "Y", 179, 5);//padding #endif - */ // Supress unused parameters on CUDA10 (void)mode; return 1; @@ -411,4 +441,4 @@ TEST_F(TestCategory, gesv_mrhs_complex_float) { } #endif -#endif // CUDA+MAGMA or LAPACK+HOST +#endif // CUDA+(MAGMA or CUSOLVER) or HIP+ROCSOLVER or LAPACK+HOST diff --git a/lapack/unit_test/Test_Lapack_svd.hpp b/lapack/unit_test/Test_Lapack_svd.hpp new file mode 100644 index 0000000000..da9f9ba480 --- /dev/null +++ b/lapack/unit_test/Test_Lapack_svd.hpp @@ -0,0 +1,658 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#include + +#include +#include +#include +#include + +#include + +namespace Test { + +template +void check_triple_product( + const AMatrix& A, const SVector& S, const UMatrix& U, const VMatrix& Vt, + typename Kokkos::ArithTraits< + typename AMatrix::non_const_value_type>::mag_type tol) { + // After a successful SVD decomposition we have A=U*S*V + // So using gemm we should be able to compare the above + // triple product to the original matrix A. + using execution_space = typename AMatrix::execution_space; + + AMatrix temp("intermediate U*S product", A.extent(0), A.extent(1)); + AMatrix M("U*S*V product", A.extent(0), A.extent(1)); + + // First compute the left side of the product: temp = U*S + Kokkos::parallel_for( + Kokkos::RangePolicy(0, U.extent_int(0)), + KOKKOS_LAMBDA(const int& rowIdx) { + for (int colIdx = 0; colIdx < U.extent_int(1); ++colIdx) { + if (colIdx < S.extent_int(0)) { + temp(rowIdx, colIdx) = U(rowIdx, colIdx) * S(colIdx); + } + } + }); + + // Second compute the right side of the product: M = temp*V = U*S*V + KokkosBlas::gemm("N", "N", 1, temp, Vt, 0, M); + + typename AMatrix::HostMirror A_h = Kokkos::create_mirror_view(A); + typename AMatrix::HostMirror M_h = Kokkos::create_mirror_view(M); + Kokkos::deep_copy(A_h, A); + Kokkos::deep_copy(M_h, M); + for (int rowIdx = 0; rowIdx < A.extent_int(0); ++rowIdx) { + for (int colIdx = 0; colIdx < A.extent_int(1); ++colIdx) { + if (tol < Kokkos::abs(A_h(rowIdx, colIdx))) { + EXPECT_NEAR_KK_REL(A_h(rowIdx, colIdx), M_h(rowIdx, colIdx), tol); + } else { + EXPECT_NEAR_KK(A_h(rowIdx, colIdx), M_h(rowIdx, colIdx), tol); + } + } + } +} + +template +void check_unitary_orthogonal_matrix( + const Matrix& M, typename Kokkos::ArithTraits< + typename Matrix::non_const_value_type>::mag_type tol) { + // After a successful SVD decomposition the matrices + // U and V are unitary matrices. Thus we can check + // the property UUt=UtU=I and VVt=VtV=I using gemm. + using scalar_type = typename Matrix::non_const_value_type; + + Matrix I0("M*Mt", M.extent(0), M.extent(0)); + KokkosBlas::gemm("N", "C", 1, M, M, 0, I0); + typename Matrix::HostMirror I0_h = Kokkos::create_mirror_view(I0); + Kokkos::deep_copy(I0_h, I0); + for (int rowIdx = 0; rowIdx < M.extent_int(0); ++rowIdx) { + for (int colIdx = 0; colIdx < M.extent_int(0); ++colIdx) { + if (rowIdx == colIdx) { + EXPECT_NEAR_KK_REL(I0_h(rowIdx, colIdx), + Kokkos::ArithTraits::one(), tol); + } else { + EXPECT_NEAR_KK(I0_h(rowIdx, colIdx), + Kokkos::ArithTraits::zero(), tol); + } + } + } + + Matrix I1("Mt*M", M.extent(1), M.extent(1)); + KokkosBlas::gemm("C", "N", 1, M, M, 0, I1); + typename Matrix::HostMirror I1_h = Kokkos::create_mirror_view(I1); + Kokkos::deep_copy(I1_h, I1); + for (int rowIdx = 0; rowIdx < M.extent_int(1); ++rowIdx) { + for (int colIdx = 0; colIdx < M.extent_int(1); ++colIdx) { + if (rowIdx == colIdx) { + EXPECT_NEAR_KK_REL(I1_h(rowIdx, colIdx), + Kokkos::ArithTraits::one(), tol); + } else { + EXPECT_NEAR_KK(I1_h(rowIdx, colIdx), + Kokkos::ArithTraits::zero(), tol); + } + } + } +} + +template +int impl_analytic_2x2_svd() { + using scalar_type = typename AMatrix::value_type; + using mag_type = typename Kokkos::ArithTraits::mag_type; + using vector_type = + Kokkos::View; + using KAT_S = Kokkos::ArithTraits; + + const mag_type eps = KAT_S::eps(); + + AMatrix A("A", 2, 2), U("U", 2, 2), Vt("Vt", 2, 2), Aref("A ref", 2, 2); + vector_type S("S", 2); + + typename AMatrix::HostMirror A_h = Kokkos::create_mirror_view(A); + + // A = [3 0] + // [4 5] + // USV = 1/sqrt(10) [1 -3] * sqrt(5) [3 0] * 1/sqrt(2) [ 1 1] + // [3 1] [0 1] [-1 1] + A_h(0, 0) = 3; + A_h(1, 0) = 4; + A_h(1, 1) = 5; + + Kokkos::deep_copy(A, A_h); + Kokkos::deep_copy(Aref, A_h); + + KokkosLapack::svd("A", "A", A, S, U, Vt); + // Don't really need to fence here as we deep_copy right after... + + typename vector_type::HostMirror S_h = Kokkos::create_mirror_view(S); + Kokkos::deep_copy(S_h, S); + typename AMatrix::HostMirror U_h = Kokkos::create_mirror_view(U); + Kokkos::deep_copy(U_h, U); + typename AMatrix::HostMirror Vt_h = Kokkos::create_mirror_view(Vt); + Kokkos::deep_copy(Vt_h, Vt); + + // The singular values for this problem + // are known: sqrt(45) and sqrt(5) + EXPECT_NEAR_KK_REL(S_h(0), static_cast(Kokkos::sqrt(45)), + 100 * eps); + EXPECT_NEAR_KK_REL(S_h(1), static_cast(Kokkos::sqrt(5)), 100 * eps); + + // The singular vectors should be identical + // or of oposite sign we check the first + // component of the vectors to determine + // the proper signed comparison. + std::vector Uref = { + static_cast(1 / Kokkos::sqrt(10)), + static_cast(3 / Kokkos::sqrt(10)), + static_cast(-3 / Kokkos::sqrt(10)), + static_cast(1 / Kokkos::sqrt(10))}; + std::vector Vtref = { + static_cast(1 / Kokkos::sqrt(2)), + static_cast(-1 / Kokkos::sqrt(2)), + static_cast(1 / Kokkos::sqrt(2)), + static_cast(1 / Kokkos::sqrt(2))}; + + // Both rotations and reflections are valid + // vector basis so we need to check both signs + // to confirm proper SVD was achieved. + Kokkos::View U_real("U real", 2, 2), + Vt_real("Vt real", 2, 2); + if constexpr (KAT_S::is_complex) { + U_real(0, 0) = U_h(0, 0).real(); + U_real(0, 1) = U_h(0, 1).real(); + U_real(1, 0) = U_h(1, 0).real(); + U_real(1, 1) = U_h(1, 1).real(); + + Vt_real(0, 0) = Vt_h(0, 0).real(); + Vt_real(0, 1) = Vt_h(0, 1).real(); + Vt_real(1, 0) = Vt_h(1, 0).real(); + Vt_real(1, 1) = Vt_h(1, 1).real(); + } else { + U_real(0, 0) = U_h(0, 0); + U_real(0, 1) = U_h(0, 1); + U_real(1, 0) = U_h(1, 0); + U_real(1, 1) = U_h(1, 1); + + Vt_real(0, 0) = Vt_h(0, 0); + Vt_real(0, 1) = Vt_h(0, 1); + Vt_real(1, 0) = Vt_h(1, 0); + Vt_real(1, 1) = Vt_h(1, 1); + } + + const mag_type tol = 100 * KAT_S::eps(); + const mag_type one_sqrt10 = static_cast(1 / Kokkos::sqrt(10)); + const mag_type one_sqrt2 = static_cast(1 / Kokkos::sqrt(2)); + + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 0)), one_sqrt10, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 1)), 3 * one_sqrt10, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 0)), 3 * one_sqrt10, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 1)), one_sqrt10, tol); + + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(0, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(0, 1)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 1)), one_sqrt2, tol); + + check_unitary_orthogonal_matrix(U, tol); + check_unitary_orthogonal_matrix(Vt, tol); + + check_triple_product(Aref, S, U, Vt, tol); + + return 0; +} + +template +int impl_analytic_2x3_svd() { + using scalar_type = typename AMatrix::value_type; + using mag_type = typename Kokkos::ArithTraits::mag_type; + using vector_type = + Kokkos::View; + using KAT_S = Kokkos::ArithTraits; + + const mag_type tol = 100 * KAT_S::eps(); + + AMatrix A("A", 2, 3), U("U", 2, 2), Vt("Vt", 3, 3), Aref("A ref", 2, 3); + vector_type S("S", 2); + + typename AMatrix::HostMirror A_h = Kokkos::create_mirror_view(A); + + // A = [3 2 2] + // [2 3 -2] + // USVt = 1/sqrt(2) [1 1] * [5 0 0] * 1/(3*sqrt(2)) [ 3 3 0] + // [1 -1] [0 3 0] [ 1 -1 4] + // [2*sqrt(2) -2*sqrt(2) + // -sqrt(2)] + A_h(0, 0) = 3; + A_h(0, 1) = 2; + A_h(0, 2) = 2; + A_h(1, 0) = 2; + A_h(1, 1) = 3; + A_h(1, 2) = -2; + + Kokkos::deep_copy(A, A_h); + Kokkos::deep_copy(Aref, A_h); + + try { + KokkosLapack::svd("A", "A", A, S, U, Vt); + } catch (const std::runtime_error& e) { + std::string test_string = e.what(); + std::string cusolver_m_less_than_n = + "CUSOLVER does not support SVD for matrices with more columns " + "than rows, you can transpose you matrix first then compute " + "SVD of that transpose: At=VSUt, and swap the output U and Vt" + " and transpose them to recover the desired SVD."; + + if (test_string == cusolver_m_less_than_n) { + return 0; + } + } + // Don't really need to fence here as we deep_copy right after... + + typename vector_type::HostMirror S_h = Kokkos::create_mirror_view(S); + Kokkos::deep_copy(S_h, S); + typename AMatrix::HostMirror U_h = Kokkos::create_mirror_view(U); + Kokkos::deep_copy(U_h, U); + typename AMatrix::HostMirror Vt_h = Kokkos::create_mirror_view(Vt); + Kokkos::deep_copy(Vt_h, Vt); + + // The singular values for this problem + // are known: sqrt(45) and sqrt(5) + EXPECT_NEAR_KK_REL(S_h(0), static_cast(5), tol); + EXPECT_NEAR_KK_REL(S_h(1), static_cast(3), tol); + + // Both rotations and reflections are valid + // vector basis so we need to check both signs + // to confirm proper SVD was achieved. + Kokkos::View U_real("U real", 2, 2), + Vt_real("Vt real", 3, 3); + if constexpr (KAT_S::is_complex) { + U_real(0, 0) = U_h(0, 0).real(); + U_real(0, 1) = U_h(0, 1).real(); + U_real(1, 0) = U_h(1, 0).real(); + U_real(1, 1) = U_h(1, 1).real(); + + Vt_real(0, 0) = Vt_h(0, 0).real(); + Vt_real(0, 1) = Vt_h(0, 1).real(); + Vt_real(0, 2) = Vt_h(0, 2).real(); + Vt_real(1, 0) = Vt_h(1, 0).real(); + Vt_real(1, 1) = Vt_h(1, 1).real(); + Vt_real(1, 2) = Vt_h(1, 2).real(); + Vt_real(2, 0) = Vt_h(2, 0).real(); + Vt_real(2, 1) = Vt_h(2, 1).real(); + Vt_real(2, 2) = Vt_h(2, 2).real(); + } else { + U_real(0, 0) = U_h(0, 0); + U_real(0, 1) = U_h(0, 1); + U_real(1, 0) = U_h(1, 0); + U_real(1, 1) = U_h(1, 1); + + Vt_real(0, 0) = Vt_h(0, 0); + Vt_real(0, 1) = Vt_h(0, 1); + Vt_real(0, 2) = Vt_h(0, 2); + Vt_real(1, 0) = Vt_h(1, 0); + Vt_real(1, 1) = Vt_h(1, 1); + Vt_real(1, 2) = Vt_h(1, 2); + Vt_real(2, 0) = Vt_h(2, 0); + Vt_real(2, 1) = Vt_h(2, 1); + Vt_real(2, 2) = Vt_h(2, 2); + } + + const mag_type one_sqrt2 = static_cast(1 / Kokkos::sqrt(2)); + const mag_type one_sqrt18 = static_cast(1 / Kokkos::sqrt(18)); + const mag_type one_third = static_cast(1. / 3.); + + // Check values of U + // Don't worry about the sign + // it will be check with the + // triple product + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 1)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 1)), one_sqrt2, tol); + + // Check values of Vt + // Don't worry about the sign + // it will be check with the + // triple product + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(0, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(0, 1)), one_sqrt2, tol); + EXPECT_NEAR_KK(Kokkos::abs(Vt_real(0, 2)), 0, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 0)), one_sqrt18, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 1)), one_sqrt18, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 2)), 4 * one_sqrt18, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(2, 0)), 2 * one_third, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(2, 1)), 2 * one_third, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(2, 2)), one_third, tol); + + check_unitary_orthogonal_matrix(U, tol); + check_unitary_orthogonal_matrix(Vt, tol); + + check_triple_product(Aref, S, U, Vt, tol); + + return 0; +} + +template +int impl_analytic_3x2_svd() { + using scalar_type = typename AMatrix::value_type; + using mag_type = typename Kokkos::ArithTraits::mag_type; + using vector_type = + Kokkos::View; + using KAT_S = Kokkos::ArithTraits; + + const mag_type tol = 100 * KAT_S::eps(); + + AMatrix A("A", 3, 2), U("U", 3, 3), Vt("Vt", 2, 2), Aref("A ref", 3, 2); + vector_type S("S", 2); + + typename AMatrix::HostMirror A_h = Kokkos::create_mirror_view(A); + + // Note this is simply the transpose of the 2x3 matrix in the test above + // A = [3 2] + // [2 3] + // [2 -2] + // USVt = 1/(3*sqrt(2)) [3 1 2*sqrt(2)] * [5 0] * 1/sqrt(2) [1 1] + // [3 -1 -2*sqrt(2)] [0 3] [1 -1] + // [0 4 sqrt(2)] [0 0] + A_h(0, 0) = 3; + A_h(0, 1) = 2; + A_h(1, 0) = 2; + A_h(1, 1) = 3; + A_h(2, 0) = 2; + A_h(2, 1) = -2; + + Kokkos::deep_copy(A, A_h); + Kokkos::deep_copy(Aref, A_h); + + KokkosLapack::svd("A", "A", A, S, U, Vt); + // Don't really need to fence here as we deep_copy right after... + + typename vector_type::HostMirror S_h = Kokkos::create_mirror_view(S); + Kokkos::deep_copy(S_h, S); + typename AMatrix::HostMirror U_h = Kokkos::create_mirror_view(U); + Kokkos::deep_copy(U_h, U); + typename AMatrix::HostMirror Vt_h = Kokkos::create_mirror_view(Vt); + Kokkos::deep_copy(Vt_h, Vt); + + // The singular values for this problem + // are known: sqrt(45) and sqrt(5) + EXPECT_NEAR_KK_REL(S_h(0), static_cast(5), tol); + EXPECT_NEAR_KK_REL(S_h(1), static_cast(3), tol); + + // Both rotations and reflections are valid + // vector basis so we need to check both signs + // to confirm proper SVD was achieved. + Kokkos::View U_real("U real", 3, 3), + Vt_real("Vt real", 2, 2); + if constexpr (KAT_S::is_complex) { + U_real(0, 0) = U_h(0, 0).real(); + U_real(0, 1) = U_h(0, 1).real(); + U_real(0, 2) = U_h(0, 2).real(); + U_real(1, 0) = U_h(1, 0).real(); + U_real(1, 1) = U_h(1, 1).real(); + U_real(1, 2) = U_h(1, 2).real(); + U_real(2, 0) = U_h(2, 0).real(); + U_real(2, 1) = U_h(2, 1).real(); + U_real(2, 2) = U_h(2, 2).real(); + + Vt_real(0, 0) = Vt_h(0, 0).real(); + Vt_real(0, 1) = Vt_h(0, 1).real(); + Vt_real(1, 0) = Vt_h(1, 0).real(); + Vt_real(1, 1) = Vt_h(1, 1).real(); + } else { + U_real(0, 0) = U_h(0, 0); + U_real(0, 1) = U_h(0, 1); + U_real(0, 2) = U_h(0, 2); + U_real(1, 0) = U_h(1, 0); + U_real(1, 1) = U_h(1, 1); + U_real(1, 2) = U_h(1, 2); + U_real(2, 0) = U_h(2, 0); + U_real(2, 1) = U_h(2, 1); + U_real(2, 2) = U_h(2, 2); + + Vt_real(0, 0) = Vt_h(0, 0); + Vt_real(0, 1) = Vt_h(0, 1); + Vt_real(1, 0) = Vt_h(1, 0); + Vt_real(1, 1) = Vt_h(1, 1); + } + + const mag_type one_sqrt2 = static_cast(1 / Kokkos::sqrt(2)); + const mag_type one_sqrt18 = static_cast(1 / Kokkos::sqrt(18)); + const mag_type one_third = static_cast(1. / 3.); + + // Check values of U + // Don't worry about the sign + // it will be check with the + // triple product + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 1)), one_sqrt18, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(0, 2)), 2 * one_third, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 1)), one_sqrt18, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(1, 2)), 2 * one_third, tol); + EXPECT_NEAR_KK(Kokkos::abs(U_real(2, 0)), 0, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(2, 1)), 4 * one_sqrt18, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(U_real(2, 2)), one_third, tol); + + // Check values of Vt + // Don't worry about the sign + // it will be check with the + // triple product + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(0, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(0, 1)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 0)), one_sqrt2, tol); + EXPECT_NEAR_KK_REL(Kokkos::abs(Vt_real(1, 1)), one_sqrt2, tol); + + check_unitary_orthogonal_matrix(U, tol); + check_unitary_orthogonal_matrix(Vt, tol); + + check_triple_product(Aref, S, U, Vt, tol); + + return 0; +} + +template +int impl_test_svd(const int m, const int n) { + using execution_space = typename Device::execution_space; + using scalar_type = typename AMatrix::value_type; + using KAT_S = Kokkos::ArithTraits; + using mag_type = typename KAT_S::mag_type; + using vector_type = + Kokkos::View; + + const mag_type max_val = 10; + const mag_type tol = 2000 * max_val * KAT_S::eps(); + + AMatrix A("A", m, n), U("U", m, m), Vt("Vt", n, n), Aref("A ref", m, n); + vector_type S("S", Kokkos::min(m, n)); + + const uint64_t seed = + std::chrono::high_resolution_clock::now().time_since_epoch().count(); + Kokkos::Random_XorShift64_Pool rand_pool(seed); + + // Initialize A with random numbers + scalar_type randStart = 0, randEnd = 0; + Test::getRandomBounds(max_val, randStart, randEnd); + Kokkos::fill_random(A, rand_pool, randStart, randEnd); + Kokkos::deep_copy(Aref, A); + + // Working around CUSOLVER constraint for m >= n +#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) + if constexpr (std::is_same_v) { + if (m >= n) { + KokkosLapack::svd("A", "A", A, S, U, Vt); + } else { + return 0; + } + } else { + KokkosLapack::svd("A", "A", A, S, U, Vt); + } +#else + KokkosLapack::svd("A", "A", A, S, U, Vt); +#endif + + check_unitary_orthogonal_matrix(U, tol); + check_unitary_orthogonal_matrix(Vt, tol); + + // For larger sizes with the triple product + // we accumulate a bit more error apparently? + check_triple_product(Aref, S, U, Vt, 100 * Kokkos::max(m, n) * tol); + + return 0; +} + +} // namespace Test + +template +int test_svd() { + int ret; + +#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) + using view_type_a_layout_left = + Kokkos::View; + + ret = Test::impl_analytic_2x2_svd(); + EXPECT_EQ(ret, 0); + + ret = Test::impl_analytic_2x3_svd(); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(0, 0); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(1, 1); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(15, 15); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(100, 100); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(100, 70); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(70, 100); + EXPECT_EQ(ret, 0); +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) + using view_type_a_layout_right = + Kokkos::View; + + ret = Test::impl_analytic_2x2_svd(); + EXPECT_EQ(ret, 0); + + ret = Test::impl_analytic_2x3_svd(); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(0, 0); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(1, 1); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(15, 15); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(100, 100); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(100, 70); + EXPECT_EQ(ret, 0); + + ret = Test::impl_test_svd(70, 100); + EXPECT_EQ(ret, 0); +#endif + + return 1; +} + +template +int test_svd_wrapper() { +#if defined(KOKKOSKERNELS_ENABLE_TPL_LAPACK) || \ + defined(KOKKOSKERNELS_ENABLE_TPL_MKL) + if constexpr (std::is_same_v) { + // Using a device side space with LAPACK/MKL + return test_svd(); + } +#endif + +#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSOLVER) + if constexpr (std::is_same_v) { + // Using a Cuda device with CUSOLVER + return test_svd(); + } +#endif + +#if defined(KOKKOSKERNELS_ENABLE_TPL_ROCSOLVER) + if constexpr (std::is_same_v) { + // Using a HIP device with ROCSOLVER + return test_svd(); + } +#endif + + std::cout << "No TPL support enabled, svd is not tested" << std::endl; + return 0; +} + +#if defined(KOKKOSKERNELS_INST_FLOAT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, svd_float) { + Kokkos::Profiling::pushRegion("KokkosLapack::Test::svd_float"); + test_svd_wrapper(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_DOUBLE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, svd_double) { + Kokkos::Profiling::pushRegion("KokkosLapack::Test::svd_double"); + test_svd_wrapper(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_COMPLEX_FLOAT) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, svd_complex_float) { + Kokkos::Profiling::pushRegion("KokkosLapack::Test::svd_complex_float"); + test_svd_wrapper, TestDevice>(); + Kokkos::Profiling::popRegion(); +} +#endif + +#if defined(KOKKOSKERNELS_INST_COMPLEX_DOUBLE) || \ + (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) +TEST_F(TestCategory, svd_complex_double) { + Kokkos::Profiling::pushRegion("KokkosLapack::Test::svd_complex_double"); + test_svd_wrapper, TestDevice>(); + Kokkos::Profiling::popRegion(); +} +#endif diff --git a/master_history.txt b/master_history.txt index 26f95694e9..2207bca133 100644 --- a/master_history.txt +++ b/master_history.txt @@ -8,7 +8,7 @@ tag: 2.8.00 date: 02/05/2019 master: a6e05e06 develop: 6a790321 tag: 2.9.00 date: 06/24/2019 master: 4ee5f3c6 develop: 094da30c tag: 3.0.00 date: 01/31/2020 master: d86db111 release-candidate-3.0: cf24ab90 tag: 3.1.00 date: 04/14/2020 master: f199f45d develop: 8d063eae -tag: 3.1.01 date: 05/04/2020 master: 43773523 release: 6fce7502 +tag: 3.1.01 date: 05/04/2020 master: 43773523 release: 6fce7502 tag: 3.2.00 date: 08/19/2020 master: 07a60bcc release: ea3f2b77 tag: 3.3.00 date: 12/16/2020 master: 42defc56 release: e5279e55 tag: 3.3.01 date: 01/18/2021 master: f64b1c57 release: 4e1cc00b @@ -24,3 +24,4 @@ tag: 4.0.01 date: 04/26/2023 master: b9c1bab7 release: 8809e41c tag: 4.1.00 date: 06/20/2023 master: 1331baf1 release: 14ad220a tag: 4.2.00 date: 11/09/2023 master: 25a31f88 release: 912d3778 tag: 4.2.01 date: 01/30/2024 master: f429f6ec release: bcf9854b +tag: 4.3.00 date: 04/03/2024 master: afd65f03 release: ebbf4b78 diff --git a/ode/impl/KokkosODE_BDF_impl.hpp b/ode/impl/KokkosODE_BDF_impl.hpp new file mode 100644 index 0000000000..cf89731f1b --- /dev/null +++ b/ode/impl/KokkosODE_BDF_impl.hpp @@ -0,0 +1,532 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSBLAS_BDF_IMPL_HPP +#define KOKKOSBLAS_BDF_IMPL_HPP + +#include "Kokkos_Core.hpp" + +#include "KokkosODE_Newton.hpp" +#include "KokkosBlas2_serial_gemv.hpp" +#include "KokkosBatched_Gemm_Decl.hpp" + +namespace KokkosODE { +namespace Impl { + +template +struct BDF_table {}; + +template <> +struct BDF_table<1> { + static constexpr int order = 1; + Kokkos::Array coefficients{{-1.0, 1.0}}; +}; + +template <> +struct BDF_table<2> { + static constexpr int order = 2; + Kokkos::Array coefficients{{-4.0 / 3.0, 1.0 / 3.0, 2.0 / 3.0}}; +}; + +template <> +struct BDF_table<3> { + static constexpr int order = 3; + Kokkos::Array coefficients{ + {-18.0 / 11.0, 9.0 / 11.0, -2.0 / 11.0, 6.0 / 11.0}}; +}; + +template <> +struct BDF_table<4> { + static constexpr int order = 4; + Kokkos::Array coefficients{ + {-48.0 / 25.0, 36.0 / 25.0, -16.0 / 25.0, 3.0 / 25.0, 12.0 / 25.0}}; +}; + +template <> +struct BDF_table<5> { + static constexpr int order = 5; + Kokkos::Array coefficients{{-300.0 / 137.0, 300.0 / 137.0, + -200.0 / 137.0, 75.0 / 137.0, + -12.0 / 137.0, 60.0 / 137.0}}; +}; + +template <> +struct BDF_table<6> { + static constexpr int order = 6; + Kokkos::Array coefficients{ + {-360.0 / 147.0, 450.0 / 147.0, -400.0 / 147.0, 225.0 / 147.0, + -72.0 / 147.0, 10.0 / 147.0, 60.0 / 147.0}}; +}; + +template +struct BDF_system_wrapper { + const system_type mySys; + const int neqs; + const table_type table; + const int order = table.order; + + double t, dt; + mv_type yn; + + KOKKOS_FUNCTION + BDF_system_wrapper(const system_type& mySys_, const table_type& table_, + const double t_, const double dt_, const mv_type& yn_) + : mySys(mySys_), + neqs(mySys_.neqs), + table(table_), + t(t_), + dt(dt_), + yn(yn_) {} + + template + KOKKOS_FUNCTION void residual(const vec_type& y, const vec_type& f) const { + // f = f(t+dt, y) + mySys.evaluate_function(t, dt, y, f); + + for (int eqIdx = 0; eqIdx < neqs; ++eqIdx) { + f(eqIdx) = y(eqIdx) - table.coefficients[order] * dt * f(eqIdx); + for (int orderIdx = 0; orderIdx < order; ++orderIdx) { + f(eqIdx) += + table.coefficients[order - 1 - orderIdx] * yn(eqIdx, orderIdx); + } + } + } + + template + KOKKOS_FUNCTION void jacobian(const vec_type& y, const mat_type& jac) const { + mySys.evaluate_jacobian(t, dt, y, jac); + + for (int rowIdx = 0; rowIdx < neqs; ++rowIdx) { + for (int colIdx = 0; colIdx < neqs; ++colIdx) { + jac(rowIdx, colIdx) = + -table.coefficients[order] * dt * jac(rowIdx, colIdx); + } + jac(rowIdx, rowIdx) += 1.0; + } + } +}; + +template +struct BDF_system_wrapper2 { + const system_type mySys; + const int neqs; + const subview_type psi; + const d_vec_type d; + + bool compute_jac = true; + double t, dt, c = 0; + + KOKKOS_FUNCTION + BDF_system_wrapper2(const system_type& mySys_, const subview_type& psi_, + const d_vec_type& d_, const double t_, const double dt_) + : mySys(mySys_), neqs(mySys_.neqs), psi(psi_), d(d_), t(t_), dt(dt_) {} + + template + KOKKOS_FUNCTION void residual(const YVectorType& y, + const FVectorType& f) const { + // f = f(t+dt, y) + mySys.evaluate_function(t, dt, y, f); + + // std::cout << "f = psi + d - c * f = " << psi(0) << " + " << d(0) << " - " + // << c << " * " << f(0) << std::endl; + + // rhs = higher order terms + y_{n+1}^i - y_n - dt*f + for (int eqIdx = 0; eqIdx < neqs; ++eqIdx) { + f(eqIdx) = psi(eqIdx) + d(eqIdx) - c * f(eqIdx); + } + } + + template + KOKKOS_FUNCTION void jacobian(const vec_type& y, const mat_type& jac) const { + if (compute_jac) { + mySys.evaluate_jacobian(t, dt, y, jac); + + // J = I - dt*(dy/dy) + for (int rowIdx = 0; rowIdx < neqs; ++rowIdx) { + for (int colIdx = 0; colIdx < neqs; ++colIdx) { + jac(rowIdx, colIdx) = -dt * jac(rowIdx, colIdx); + } + jac(rowIdx, rowIdx) += 1.0; + } + } + } +}; + +template +KOKKOS_FUNCTION void BDFStep(ode_type& ode, const table_type& table, + scalar_type t, scalar_type dt, + const vec_type& y_old, const vec_type& y_new, + const vec_type& rhs, const vec_type& update, + const vec_type& scale, const mv_type& y_vecs, + const mat_type& temp, const mat_type& jac) { + using newton_params = KokkosODE::Experimental::Newton_params; + + BDF_system_wrapper sys(ode, table, t, dt, y_vecs); + const newton_params param(50, 1e-14, 1e-12); + + // first set y_new = y_old + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + y_new(eqIdx) = y_old(eqIdx); + } + + // solver the nonlinear problem + { + KokkosODE::Experimental::Newton::Solve(sys, param, jac, temp, y_new, rhs, + update, scale); + } + +} // BDFStep + +template +KOKKOS_FUNCTION void compute_coeffs(const int order, const scalar_type factor, + const mat_type& coeffs) { + coeffs(0, 0) = 1.0; + for (int colIdx = 0; colIdx < order; ++colIdx) { + coeffs(0, colIdx + 1) = 1.0; + for (int rowIdx = 0; rowIdx < order; ++rowIdx) { + coeffs(rowIdx + 1, colIdx + 1) = + ((rowIdx - factor * (colIdx + 1.0)) / (rowIdx + 1.0)) * + coeffs(rowIdx, colIdx + 1); + } + } +} + +template +KOKKOS_FUNCTION void update_D(const int order, const scalar_type factor, + const mat_type& coeffs, const mat_type& tempD, + const mat_type& D) { + auto subD = + Kokkos::subview(D, Kokkos::ALL(), Kokkos::pair(0, order + 1)); + auto subTempD = Kokkos::subview(tempD, Kokkos::ALL(), + Kokkos::pair(0, order + 1)); + + compute_coeffs(order, factor, coeffs); + auto R = Kokkos::subview(coeffs, Kokkos::pair(0, order + 1), + Kokkos::pair(0, order + 1)); + KokkosBatched::SerialGemm< + KokkosBatched::Trans::NoTranspose, KokkosBatched::Trans::NoTranspose, + KokkosBatched::Algo::Gemm::Blocked>::invoke(1.0, subD, R, 0.0, subTempD); + + compute_coeffs(order, 1.0, coeffs); + auto U = Kokkos::subview(coeffs, Kokkos::pair(0, order + 1), + Kokkos::pair(0, order + 1)); + KokkosBatched::SerialGemm< + KokkosBatched::Trans::NoTranspose, KokkosBatched::Trans::NoTranspose, + KokkosBatched::Algo::Gemm::Blocked>::invoke(1.0, subTempD, U, 0.0, subD); +} + +template +KOKKOS_FUNCTION void initial_step_size( + const ode_type ode, const int order, const scalar_type t0, + const scalar_type atol, const scalar_type rtol, const vec_type& y0, + const res_type& f0, const mat_type& temp, scalar_type& dt_ini) { + using KAT = Kokkos::ArithTraits; + + // Extract subviews to store intermediate data + auto scale = Kokkos::subview(temp, Kokkos::ALL(), 1); + auto y1 = Kokkos::subview(temp, Kokkos::ALL(), 2); + auto f1 = Kokkos::subview(temp, Kokkos::ALL(), 3); + + // Compute norms for y0 and f0 + double n0 = KAT::zero(), n1 = KAT::zero(), dt0; + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + scale(eqIdx) = atol + rtol * Kokkos::abs(y0(eqIdx)); + n0 += Kokkos::pow(y0(eqIdx) / scale(eqIdx), 2); + n1 += Kokkos::pow(f0(eqIdx) / scale(eqIdx), 2); + } + n0 = Kokkos::sqrt(n0) / Kokkos::sqrt(ode.neqs); + n1 = Kokkos::sqrt(n1) / Kokkos::sqrt(ode.neqs); + + // Select dt0 + if ((n0 < 1e-5) || (n1 < 1e-5)) { + dt0 = 1e-6; + } else { + dt0 = 0.01 * n0 / n1; + } + + // Estimate y at t0 + dt0 + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + y1(eqIdx) = y0(eqIdx) + dt0 * f0(eqIdx); + } + + // Compute f at t0+dt0 and y1, + // then compute the norm of f(t0+dt0, y1) - f(t0, y0) + scalar_type n2 = KAT::zero(); + ode.evaluate_function(t0 + dt0, dt0, y1, f1); + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + n2 += Kokkos::pow((f1(eqIdx) - f0(eqIdx)) / scale(eqIdx), 2); + } + n2 = Kokkos::sqrt(n2) / (dt0 * Kokkos::sqrt(ode.neqs)); + + // Finally select initial time step dt_ini + if ((n1 <= 1e-15) && (n2 <= 1e-15)) { + dt_ini = Kokkos::max(1e-6, dt0 * 1e-3); + } else { + dt_ini = Kokkos::pow(0.01 / Kokkos::max(n1, n2), KAT::one() / (order + 1)); + } + + dt_ini = Kokkos::min(100 * dt0, dt_ini); + + // Zero out temp variables just to be safe... + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + scale(eqIdx) = 0; + y1(eqIdx) = 0; + f1(eqIdx) = 0; + } +} // initial_step_size + +template +KOKKOS_FUNCTION void BDFStep(ode_type& ode, scalar_type& t, scalar_type& dt, + scalar_type t_end, int& order, + int& num_equal_steps, const int max_newton_iters, + const scalar_type atol, const scalar_type rtol, + const scalar_type min_factor, + const vec_type& y_old, const vec_type& y_new, + const res_type& rhs, const res_type& update, + const mat_type& temp, const mat_type& temp2) { + using newton_params = KokkosODE::Experimental::Newton_params; + + constexpr int max_order = 5; + + // For NDF coefficients see Sahmpine and Reichelt, The Matlab ODE suite, SIAM + // SISCm 18, 1, p1-22, January 1997 Kokkos::Array kappa{{0., + // -0.1850, -1/9 , -0.0823000, -0.0415000, 0.}}; // NDF coefficients + // kappa gamma(i) = sum_{k=1}^i(1.0 / k); gamma(0) = 0; // NDF coefficients + // gamma_k alpha(i) = (1 - kappa(i)) * gamma(i) error_const(i) = kappa(i) * + // gamma(i) + 1 / (i + 1) + const Kokkos::Array alpha{ + {0., 1.185, 1.66666667, 1.98421667, 2.16979167, 2.28333333}}; + const Kokkos::Array error_const{ + {1., 0.315, 0.16666667, 0.09911667, 0.11354167, 0.16666667}}; + + // Extract columns of temp to form temporary + // subviews to operate on. + // const int numRows = temp.extent_int(0); const int numCols = + // temp.extent_int(1); std::cout << "numRows: " << numRows << ", numCols: " << + // numCols << std::endl; std::cout << "Extract subview from temp" << + // std::endl; + int offset = 2; + auto D = Kokkos::subview( + temp, Kokkos::ALL(), + Kokkos::pair(offset, offset + 8)); // y and its derivatives + offset += 8; + auto tempD = Kokkos::subview(temp, Kokkos::ALL(), + Kokkos::pair(offset, offset + 8)); + offset += 8; + auto scale = Kokkos::subview(temp, Kokkos::ALL(), offset + 1); + ++offset; // Scaling coefficients for error calculation + auto y_predict = Kokkos::subview(temp, Kokkos::ALL(), offset + 1); + ++offset; // Initial guess for y_{n+1} + auto psi = Kokkos::subview(temp, Kokkos::ALL(), offset + 1); + ++offset; // Higher order terms contribution to rhs + auto error = Kokkos::subview(temp, Kokkos::ALL(), offset + 1); + ++offset; // Error estimate + auto jac = Kokkos::subview( + temp, Kokkos::ALL(), + Kokkos::pair(offset, offset + ode.neqs)); // Jacobian matrix + offset += ode.neqs; + auto tmp_gesv = Kokkos::subview( + temp, Kokkos::ALL(), + Kokkos::pair( + offset, offset + ode.neqs + 4)); // Buffer space for gesv calculation + offset += ode.neqs + 4; + + auto coeffs = + Kokkos::subview(temp2, Kokkos::ALL(), Kokkos::pair(0, 6)); + auto gamma = Kokkos::subview(temp2, Kokkos::ALL(), 6); + gamma(0) = 0.0; + gamma(1) = 1.0; + gamma(2) = 1.5; + gamma(3) = 1.83333333; + gamma(4) = 2.08333333; + gamma(5) = 2.28333333; + + BDF_system_wrapper2 sys(ode, psi, update, t, dt); + const newton_params param( + max_newton_iters, atol, + Kokkos::max(10 * Kokkos::ArithTraits::eps() / rtol, + Kokkos::min(0.03, Kokkos::sqrt(rtol)))); + + scalar_type max_step = Kokkos::ArithTraits::max(); + scalar_type min_step = Kokkos::ArithTraits::min(); + scalar_type safety = 0.675, error_norm; + if (dt > max_step) { + update_D(order, max_step / dt, coeffs, tempD, D); + dt = max_step; + num_equal_steps = 0; + } else if (dt < min_step) { + update_D(order, min_step / dt, coeffs, tempD, D); + dt = min_step; + num_equal_steps = 0; + } + + // first set y_new = y_old + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + y_new(eqIdx) = y_old(eqIdx); + } + + double t_new = 0; + bool step_accepted = false; + while (!step_accepted) { + if (dt < min_step) { + return; + } + t_new = t + dt; + + if (t_new > t_end) { + t_new = t_end; + update_D(order, (t_new - t) / dt, coeffs, tempD, D); + num_equal_steps = 0; + } + dt = t_new - t; + + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + y_predict(eqIdx) = 0; + for (int orderIdx = 0; orderIdx < order + 1; ++orderIdx) { + y_predict(eqIdx) += D(eqIdx, orderIdx); + } + scale(eqIdx) = atol + rtol * Kokkos::abs(y_predict(eqIdx)); + } + + // Compute psi, the sum of the higher order + // contribution to the residual + auto subD = + Kokkos::subview(D, Kokkos::ALL(), Kokkos::pair(1, order + 1)); + auto subGamma = + Kokkos::subview(gamma, Kokkos::pair(1, order + 1)); + KokkosBlas::Experimental::serial_gemv('N', 1.0 / alpha[order], subD, + subGamma, 0.0, psi); + + sys.compute_jac = true; + sys.c = dt / alpha[order]; + sys.jacobian(y_new, jac); + sys.compute_jac = true; + Kokkos::Experimental::local_deep_copy(y_new, y_predict); + Kokkos::Experimental::local_deep_copy(update, 0); + KokkosODE::Experimental::newton_solver_status newton_status = + KokkosODE::Experimental::Newton::Solve(sys, param, jac, tmp_gesv, y_new, + rhs, update, scale); + + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + update(eqIdx) = y_new(eqIdx) - y_predict(eqIdx); + } + + if (newton_status == + KokkosODE::Experimental::newton_solver_status::MAX_ITER) { + dt = 0.5 * dt; + update_D(order, 0.5, coeffs, tempD, D); + num_equal_steps = 0; + + } else { + // Estimate the solution error + safety = 0.9 * (2 * max_newton_iters + 1) / + (2 * max_newton_iters + param.iters); + error_norm = 0; + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + scale(eqIdx) = atol + rtol * Kokkos::abs(y_new(eqIdx)); + error(eqIdx) = error_const[order] * update(eqIdx) / scale(eqIdx); + error_norm += error(eqIdx) * error(eqIdx); + } + error_norm = Kokkos::sqrt(error_norm) / Kokkos::sqrt(sys.neqs); + + // Check error norm and adapt step size or accept step + if (error_norm > 1) { + scalar_type factor = Kokkos::max( + min_factor, safety * Kokkos::pow(error_norm, -1.0 / (order + 1))); + dt = factor * dt; + update_D(order, factor, coeffs, tempD, D); + num_equal_steps = 0; + } else { + step_accepted = true; + } + } + } // while(!step_accepted) + + // Now that our time step has been + // accepted we update all our states + // and see if we can adapt the order + // or the time step before going to + // the next step. + ++num_equal_steps; + t = t_new; + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + D(eqIdx, order + 2) = update(eqIdx) - D(eqIdx, order + 1); + D(eqIdx, order + 1) = update(eqIdx); + for (int orderIdx = order; 0 <= orderIdx; --orderIdx) { + D(eqIdx, orderIdx) += D(eqIdx, orderIdx + 1); + } + } + + // Not enough steps at constant dt + // have been succesfull so we do not + // attempt order adaptation. + double error_low = 0, error_high = 0; + if (num_equal_steps < order + 1) { + return; + } + + if (1 < order) { + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + error_low += Kokkos::pow( + error_const[order - 1] * D(eqIdx, order) / scale(eqIdx), 2); + } + error_low = Kokkos::sqrt(error_low) / Kokkos::sqrt(sys.neqs); + } else { + error_low = Kokkos::ArithTraits::max(); + } + + if (order < max_order) { + for (int eqIdx = 0; eqIdx < sys.neqs; ++eqIdx) { + error_high += Kokkos::pow( + error_const[order + 1] * D(eqIdx, order + 2) / scale(eqIdx), 2); + } + error_high = Kokkos::sqrt(error_high) / Kokkos::sqrt(sys.neqs); + } else { + error_high = Kokkos::ArithTraits::max(); + } + + double factor_low, factor_mid, factor_high, factor; + factor_low = Kokkos::pow(error_low, -1.0 / order); + factor_mid = Kokkos::pow(error_norm, -1.0 / (order + 1)); + factor_high = Kokkos::pow(error_high, -1.0 / (order + 2)); + + int delta_order = 0; + if ((factor_mid < factor_low) && (factor_high < factor_low)) { + delta_order = -1; + factor = factor_low; + } else if ((factor_low < factor_high) && (factor_mid < factor_high)) { + delta_order = 1; + factor = factor_high; + } else { + delta_order = 0; + factor = factor_mid; + } + order += delta_order; + factor = Kokkos::fmin(10, safety * factor); + dt *= factor; + + update_D(order, factor, coeffs, tempD, D); + num_equal_steps = 0; + +} // BDFStep + +} // namespace Impl +} // namespace KokkosODE + +#endif // KOKKOSBLAS_BDF_IMPL_HPP diff --git a/ode/impl/KokkosODE_Newton_impl.hpp b/ode/impl/KokkosODE_Newton_impl.hpp index d5000a74ab..348bf0aa22 100644 --- a/ode/impl/KokkosODE_Newton_impl.hpp +++ b/ode/impl/KokkosODE_Newton_impl.hpp @@ -30,18 +30,29 @@ namespace KokkosODE { namespace Impl { -template +template KOKKOS_FUNCTION KokkosODE::Experimental::newton_solver_status NewtonSolve( system_type& sys, const KokkosODE::Experimental::Newton_params& params, - mat_type& J, mat_type& tmp, vec_type& y0, vec_type& rhs, vec_type& update) { + mat_type& J, mat_type& tmp, ini_vec_type& y0, rhs_vec_type& rhs, + update_type& update, const scale_type& scale) { using newton_solver_status = KokkosODE::Experimental::newton_solver_status; - using value_type = typename vec_type::non_const_value_type; + using value_type = typename ini_vec_type::non_const_value_type; // Define the type returned by nrm2 to store // the norm of the residual. using norm_type = typename Kokkos::Details::InnerProductSpaceTraits< - typename vec_type::non_const_value_type>::mag_type; - norm_type norm = Kokkos::ArithTraits::zero(); + typename ini_vec_type::non_const_value_type>::mag_type; + sys.residual(y0, rhs); + const norm_type norm0 = KokkosBlas::serial_nrm2(rhs); + norm_type norm = Kokkos::ArithTraits::zero(); + norm_type norm_old = Kokkos::ArithTraits::zero(); + norm_type norm_new = Kokkos::ArithTraits::zero(); + norm_type rate = Kokkos::ArithTraits::zero(); + + const norm_type tol = + Kokkos::max(10 * Kokkos::ArithTraits::eps() / params.rel_tol, + Kokkos::min(0.03, Kokkos::sqrt(params.rel_tol))); // LBV - 07/24/2023: for now assume that we take // a full Newton step. Eventually this value can @@ -57,12 +68,6 @@ KOKKOS_FUNCTION KokkosODE::Experimental::newton_solver_status NewtonSolve( // Solve the following linearized // problem at each iteration: J*update=-rhs // with J=du/dx, rhs=f(u_n+update)-f(u_n) - norm = KokkosBlas::serial_nrm2(rhs); - - if ((norm < params.rel_tol) || - (it > 0 ? KokkosBlas::serial_nrm2(update) < params.abs_tol : false)) { - return newton_solver_status::NLS_SUCCESS; - } // compute LHS sys.jacobian(y0, J); @@ -73,6 +78,26 @@ KOKKOS_FUNCTION KokkosODE::Experimental::newton_solver_status NewtonSolve( J, update, rhs, tmp); KokkosBlas::SerialScale::invoke(-1, update); + // update solution // x = x + alpha*update + KokkosBlas::serial_axpy(alpha, update, y0); + norm = KokkosBlas::serial_nrm2(rhs); + + // Compute rms norm of the scaled update + for (int idx = 0; idx < sys.neqs; ++idx) { + norm_new = (update(idx) * update(idx)) / (scale(idx) * scale(idx)); + } + norm_new = Kokkos::sqrt(norm_new / sys.neqs); + if ((it > 0) && norm_old > Kokkos::ArithTraits::zero()) { + rate = norm_new / norm_old; + if ((rate >= 1) || + Kokkos::pow(rate, params.max_iters - it) / (1 - rate) * norm_new > + tol) { + return newton_solver_status::NLS_DIVERGENCE; + } else if ((norm_new == 0) || ((rate / (1 - rate)) * norm_new < tol)) { + return newton_solver_status::NLS_SUCCESS; + } + } + if (linSolverStat == 1) { #if KOKKOS_VERSION < 40199 KOKKOS_IMPL_DO_NOT_USE_PRINTF( @@ -83,8 +108,12 @@ KOKKOS_FUNCTION KokkosODE::Experimental::newton_solver_status NewtonSolve( return newton_solver_status::LIN_SOLVE_FAIL; } - // update solution // x = x + alpha*update - KokkosBlas::serial_axpy(alpha, update, y0); + if ((norm < (params.rel_tol * norm0)) || + (it > 0 ? KokkosBlas::serial_nrm2(update) < params.abs_tol : false)) { + return newton_solver_status::NLS_SUCCESS; + } + + norm_old = norm_new; } return newton_solver_status::MAX_ITER; } diff --git a/ode/src/KokkosODE_BDF.hpp b/ode/src/KokkosODE_BDF.hpp new file mode 100644 index 0000000000..71a450a1c6 --- /dev/null +++ b/ode/src/KokkosODE_BDF.hpp @@ -0,0 +1,227 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSODE_BDF_HPP +#define KOKKOSODE_BDF_HPP + +/// \author Luc Berger-Vergiat (lberge@sandia.gov) +/// \file KokkosODE_BDF.hpp + +#include "Kokkos_Core.hpp" +#include "KokkosODE_Types.hpp" +#include "KokkosODE_RungeKutta.hpp" + +#include "KokkosODE_BDF_impl.hpp" + +namespace KokkosODE { +namespace Experimental { + +enum BDF_type : int { + BDF1 = 0, + BDF2 = 1, + BDF3 = 2, + BDF4 = 3, + BDF5 = 4, + BDF6 = 5 +}; + +template +struct BDF_coeff_helper { + using table_type = void; + + BDF_coeff_helper() = default; +}; + +template <> +struct BDF_coeff_helper { + using table_type = KokkosODE::Impl::BDF_table<1>; + + BDF_coeff_helper() = default; +}; + +template <> +struct BDF_coeff_helper { + using table_type = KokkosODE::Impl::BDF_table<2>; + + BDF_coeff_helper() = default; +}; + +template <> +struct BDF_coeff_helper { + using table_type = KokkosODE::Impl::BDF_table<3>; + + BDF_coeff_helper() = default; +}; + +template <> +struct BDF_coeff_helper { + using table_type = KokkosODE::Impl::BDF_table<4>; + + BDF_coeff_helper() = default; +}; + +template <> +struct BDF_coeff_helper { + using table_type = KokkosODE::Impl::BDF_table<5>; + + BDF_coeff_helper() = default; +}; + +template <> +struct BDF_coeff_helper { + using table_type = KokkosODE::Impl::BDF_table<6>; + + BDF_coeff_helper() = default; +}; + +template +struct BDF { + using table_type = typename BDF_coeff_helper::table_type; + + template + KOKKOS_FUNCTION static void Solve( + const ode_type& ode, const scalar_type t_start, const scalar_type t_end, + const int num_steps, const vec_type& y0, const vec_type& y, + const vec_type& rhs, const vec_type& update, const vec_type& scale, + const mv_type& y_vecs, const mv_type& kstack, const mat_type& temp, + const mat_type& jac) { + const table_type table{}; + + const double dt = (t_end - t_start) / num_steps; + double t = t_start; + + // Load y0 into y_vecs(:, 0) + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + y_vecs(eqIdx, 0) = y0(eqIdx); + } + + // Compute initial start-up history vectors + // Using a non adaptive explicit method. + const int init_steps = table.order - 1; + if (num_steps < init_steps) { + return; + } + KokkosODE::Experimental::ODE_params params(table.order - 1); + for (int stepIdx = 0; stepIdx < init_steps; ++stepIdx) { + KokkosODE::Experimental::RungeKutta::Solve( + ode, params, t, t + dt, y0, y, update, kstack); + + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + y_vecs(eqIdx, stepIdx + 1) = y(eqIdx); + y0(eqIdx) = y(eqIdx); + } + t += dt; + } + + for (int stepIdx = init_steps; stepIdx < num_steps; ++stepIdx) { + KokkosODE::Impl::BDFStep(ode, table, t, dt, y0, y, rhs, update, scale, + y_vecs, temp, jac); + + // Update history + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + y0(eqIdx) = y(eqIdx); + for (int orderIdx = 0; orderIdx < table.order - 1; ++orderIdx) { + y_vecs(eqIdx, orderIdx) = y_vecs(eqIdx, orderIdx + 1); + } + y_vecs(eqIdx, table.order - 1) = y(eqIdx); + } + t += dt; + } + } // Solve() +}; + +/// \brief BDF Solve integrates an ordinary differential equation +/// using an order and time adaptive BDF method. +/// +/// The integration starts with a BDF1 method and adaptively increases +/// or decreases both dt and the order of integration based on error +/// estimators. This function is marked as KOKKOS_FUNCTION so it can +/// be called on host and device. +/// +/// \tparam ode_type the type of the ode object to integrated +/// \tparam mv_type a rank-2 view +/// \tparam vec_type a rank-1 view +/// +/// \param ode [in]: the ode to integrate +/// \param t_start [in]: time at which the integration starts +/// \param t_end [in]: time at which the integration stops +/// \param initial_step [in]: initial value for dt +/// \param max_step [in]: maximum value for dt +/// \param y0 [in/out]: vector of initial conditions, set to the solution +/// at the end of the integration +/// \param y_new [out]: vector of solution at t_end +/// \param temp [in]: vectors for temporary storage +/// \param temp2 [in]: vectors for temporary storage +template +KOKKOS_FUNCTION void BDFSolve(const ode_type& ode, const scalar_type t_start, + const scalar_type t_end, + const scalar_type initial_step, + const scalar_type max_step, const vec_type& y0, + const vec_type& y_new, mat_type& temp, + mat_type& temp2) { + using KAT = Kokkos::ArithTraits; + + // This needs to go away and be pulled out of temp instead... + auto rhs = Kokkos::subview(temp, Kokkos::ALL(), 0); + auto update = Kokkos::subview(temp, Kokkos::ALL(), 1); + // vec_type rhs("rhs", ode.neqs), update("update", ode.neqs); + (void)max_step; + + int order = 1, num_equal_steps = 0; + constexpr scalar_type min_factor = 0.2; + scalar_type dt = initial_step; + scalar_type t = t_start; + + constexpr int max_newton_iters = 10; + scalar_type atol = 1.0e-6, rtol = 1.0e-3; + + // Compute rhs = f(t_start, y0) + ode.evaluate_function(t_start, 0, y0, rhs); + + // Check if we need to compute the initial + // time step size. + if (initial_step == KAT::zero()) { + KokkosODE::Impl::initial_step_size(ode, order, t_start, atol, rtol, y0, rhs, + temp, dt); + } + + // Initialize D(:, 0) = y0 and D(:, 1) = dt*rhs + auto D = Kokkos::subview(temp, Kokkos::ALL(), Kokkos::pair(2, 10)); + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + D(eqIdx, 0) = y0(eqIdx); + D(eqIdx, 1) = dt * rhs(eqIdx); + rhs(eqIdx) = 0; + } + + // Now we loop over the time interval [t_start, t_end] + // and solve our ODE. + while (t < t_end) { + KokkosODE::Impl::BDFStep(ode, t, dt, t_end, order, num_equal_steps, + max_newton_iters, atol, rtol, min_factor, y0, + y_new, rhs, update, temp, temp2); + + for (int eqIdx = 0; eqIdx < ode.neqs; ++eqIdx) { + y0(eqIdx) = y_new(eqIdx); + } + // printf("t=%f, dt=%f, y={%f, %f, %f}\n", t, dt, y0(0), y0(1), y0(2)); + } +} // BDFSolve + +} // namespace Experimental +} // namespace KokkosODE + +#endif // KOKKOSODE_BDF_HPP diff --git a/ode/src/KokkosODE_Newton.hpp b/ode/src/KokkosODE_Newton.hpp index 94c96e2eea..ffccba5cd3 100644 --- a/ode/src/KokkosODE_Newton.hpp +++ b/ode/src/KokkosODE_Newton.hpp @@ -30,12 +30,14 @@ namespace Experimental { /// \brief Newton solver for non-linear system of equations struct Newton { - template + template KOKKOS_FUNCTION static newton_solver_status Solve( const system_type& sys, const Newton_params& params, const mat_type& J, - const mat_type& tmp, const vec_type& y0, const vec_type& rhs, - const vec_type& update) { - return KokkosODE::Impl::NewtonSolve(sys, params, J, tmp, y0, rhs, update); + const mat_type& tmp, const ini_vec_type& y0, const rhs_vec_type& rhs, + const update_type& update, const scale_type& scale) { + return KokkosODE::Impl::NewtonSolve(sys, params, J, tmp, y0, rhs, update, + scale); } }; diff --git a/ode/src/KokkosODE_Types.hpp b/ode/src/KokkosODE_Types.hpp index 7d78227526..5fb2c44846 100644 --- a/ode/src/KokkosODE_Types.hpp +++ b/ode/src/KokkosODE_Types.hpp @@ -54,16 +54,19 @@ struct ODE_params { enum newton_solver_status : int { NLS_SUCCESS = 0, MAX_ITER = 1, - LIN_SOLVE_FAIL = 2 + LIN_SOLVE_FAIL = 2, + NLS_DIVERGENCE = 3, }; struct Newton_params { - int max_iters; + int max_iters, iters = 0; double abs_tol, rel_tol; - // Constructor that only specify the desired number of steps. - // In this case no adaptivity is provided, the time step will - // be constant such that dt = (tend - tstart) / num_steps; + // Constructor that sets basic solver parameters + // used while solving the nonlinear system + // int max_iters_ [in]: maximum number of iterations allowed + // double abs_tol_ [in]: absolute tolerance to reach for successful solve + // double rel_tol_ [in]: relative tolerance to reach for successful solve KOKKOS_FUNCTION Newton_params(const int max_iters_, const double abs_tol_, const double rel_tol_) diff --git a/ode/unit_test/Test_ODE.hpp b/ode/unit_test/Test_ODE.hpp index 5d4861879b..1b55171581 100644 --- a/ode/unit_test/Test_ODE.hpp +++ b/ode/unit_test/Test_ODE.hpp @@ -22,5 +22,6 @@ // Implicit integrators #include "Test_ODE_Newton.hpp" +#include "Test_ODE_BDF.hpp" #endif // TEST_ODE_HPP diff --git a/ode/unit_test/Test_ODE_BDF.hpp b/ode/unit_test/Test_ODE_BDF.hpp new file mode 100644 index 0000000000..8360302971 --- /dev/null +++ b/ode/unit_test/Test_ODE_BDF.hpp @@ -0,0 +1,830 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#include +#include "KokkosKernels_TestUtils.hpp" + +#include "KokkosODE_BDF.hpp" + +namespace Test { + +// Logistic equation +// Used to model population growth +// it is a simple nonlinear ODE with +// a lot of literature. +// +// Equation: y'(t) = r*y(t)*(1-y(t)/K) +// Jacobian: df/dy = r - 2*r*y/K +// Solution: y = K / (1 + ((K - y0) / y0)*exp(-rt)) +struct Logistic { + static constexpr int neqs = 1; + + const double r, K; + + Logistic(double r_, double K_) : r(r_), K(K_){}; + + template + KOKKOS_FUNCTION void evaluate_function(const double /*t*/, + const double /*dt*/, + const vec_type1& y, + const vec_type2& f) const { + f(0) = r * y(0) * (1.0 - y(0) / K); + } + + template + KOKKOS_FUNCTION void evaluate_jacobian(const double /*t*/, + const double /*dt*/, const vec_type& y, + const mat_type& jac) const { + jac(0, 0) = r - 2 * r * y(0) / K; + } + + template + KOKKOS_FUNCTION void solution(const double t, const vec_type& y0, + const vec_type& y) const { + y(0) = K / (1 + (K - y0) / y0 * Kokkos::exp(-r * t)); + } + +}; // Logistic + +// Lotka-Volterra equation +// A predator-prey model that describe +// population dynamics when two species +// interact. +// +// Equations: y0'(t) = alpha*y0(t) - beta*y0(t)*y1(t) +// y1'(t) = delta*y0(t)*y1(t) - gamma*y1(t) +// Jacobian: df0/dy = [alpha-beta*y1(t); beta*y0(t)] +// df1/dy = [delta*y1(t); delta*y0(t)-gamma] +// Solution: y = K / (1 + ((K - y0) / y0)*exp(-rt)) +struct LotkaVolterra { + static constexpr int neqs = 2; + + const double alpha, beta, delta, gamma; + + LotkaVolterra(double alpha_, double beta_, double delta_, double gamma_) + : alpha(alpha_), beta(beta_), delta(delta_), gamma(gamma_){}; + + template + KOKKOS_FUNCTION void evaluate_function(const double /*t*/, + const double /*dt*/, + const vec_type1& y, + const vec_type2& f) const { + f(0) = alpha * y(0) - beta * y(0) * y(1); + f(1) = delta * y(0) * y(1) - gamma * y(1); + } + + template + KOKKOS_FUNCTION void evaluate_jacobian(const double /*t*/, + const double /*dt*/, const vec_type& y, + const mat_type& jac) const { + jac(0, 0) = alpha - beta * y(1); + jac(0, 1) = -beta * y(0); + jac(1, 0) = delta * y(1); + jac(1, 1) = delta * y(0) - gamma; + } + +}; // LotkaVolterra + +// Robertson's autocatalytic chemical reaction: +// H. H. Robertson, The solution of a set of reaction rate equations, +// in J. Walsh (Ed.), Numerical Analysis: An Introduction, +// pp. 178–182, Academic Press, London (1966). +// +// Equations: y0' = -0.04*y0 + 10e4*y1*y2 +// y1' = 0.04*y0 - 10e4*y1*y2 - 3e7 * y1**2 +// y2' = 3e7 * y1**2 +struct StiffChemistry { + static constexpr int neqs = 3; + + StiffChemistry() {} + + template + KOKKOS_FUNCTION void evaluate_function(const double /*t*/, + const double /*dt*/, + const vec_type1& y, + const vec_type2& f) const { + f(0) = -0.04 * y(0) + 1.e4 * y(1) * y(2); + f(1) = 0.04 * y(0) - 1.e4 * y(1) * y(2) - 3.e7 * y(1) * y(1); + f(2) = 3.e7 * y(1) * y(1); + } + + template + KOKKOS_FUNCTION void evaluate_jacobian(const double /*t*/, + const double /*dt*/, const vec_type& y, + const mat_type& jac) const { + jac(0, 0) = -0.04; + jac(0, 1) = 1.e4 * y(2); + jac(0, 2) = 1.e4 * y(1); + jac(1, 0) = 0.04; + jac(1, 1) = -1.e4 * y(2) - 3.e7 * 2.0 * y(1); + jac(1, 2) = -1.e4 * y(1); + jac(2, 0) = 0.0; + jac(2, 1) = 3.e7 * 2.0 * y(1); + jac(2, 2) = 0.0; + } +}; + +template +struct BDFSolve_wrapper { + ode_type my_ode; + scalar_type tstart, tend; + int num_steps; + vec_type y_old, y_new, rhs, update, scale; + mv_type y_vecs, kstack; + mat_type temp, jac; + + BDFSolve_wrapper(const ode_type& my_ode_, const scalar_type tstart_, + const scalar_type tend_, const int num_steps_, + const vec_type& y_old_, const vec_type& y_new_, + const vec_type& rhs_, const vec_type& update_, + const vec_type& scale_, const mv_type& y_vecs_, + const mv_type& kstack_, const mat_type& temp_, + const mat_type& jac_) + : my_ode(my_ode_), + tstart(tstart_), + tend(tend_), + num_steps(num_steps_), + y_old(y_old_), + y_new(y_new_), + rhs(rhs_), + update(update_), + scale(scale_), + y_vecs(y_vecs_), + kstack(kstack_), + temp(temp_), + jac(jac_) {} + + KOKKOS_FUNCTION + void operator()(const int /*idx*/) const { + KokkosODE::Experimental::BDF::Solve( + my_ode, tstart, tend, num_steps, y_old, y_new, rhs, update, scale, + y_vecs, kstack, temp, jac); + } +}; + +template +struct BDF_Solve_wrapper { + const ode_type my_ode; + const scalar_type t_start, t_end, dt, max_step; + const vec_type y0, y_new; + const mat_type temp, temp2; + + BDF_Solve_wrapper(const ode_type& my_ode_, const scalar_type& t_start_, + const scalar_type& t_end_, const scalar_type& dt_, + const scalar_type& max_step_, const vec_type& y0_, + const vec_type& y_new_, const mat_type& temp_, + const mat_type& temp2_) + : my_ode(my_ode_), + t_start(t_start_), + t_end(t_end_), + dt(dt_), + max_step(max_step_), + y0(y0_), + y_new(y_new_), + temp(temp_), + temp2(temp2_) {} + + KOKKOS_FUNCTION void operator()(const int) const { + KokkosODE::Experimental::BDFSolve(my_ode, t_start, t_end, dt, max_step, y0, + y_new, temp, temp2); + } +}; + +template +void test_BDF_Logistic() { + using execution_space = typename device_type::execution_space; + using vec_type = Kokkos::View; + using mv_type = Kokkos::View; + using mat_type = Kokkos::View; + + Kokkos::RangePolicy myPolicy(0, 1); + Logistic mySys(1, 1); + + constexpr int num_tests = 7; + int num_steps[num_tests] = {512, 256, 128, 64, 32, 16, 8}; + double errors[num_tests] = {0}; + const scalar_type t_start = 0.0, t_end = 6.0; + vec_type y0("initial conditions", mySys.neqs), y_new("solution", mySys.neqs); + vec_type rhs("rhs", mySys.neqs), update("update", mySys.neqs); + vec_type scale("scaling factors", mySys.neqs); + mat_type jac("jacobian", mySys.neqs, mySys.neqs), + temp("temp storage", mySys.neqs, mySys.neqs + 4); + mv_type kstack("Startup RK vectors", 6, mySys.neqs); + + Kokkos::deep_copy(scale, 1); + + scalar_type measured_order; + + // Test BDF1 +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "\nBDF1 convergence test" << std::endl; +#endif + for (int idx = 0; idx < num_tests; idx++) { + mv_type y_vecs("history vectors", mySys.neqs, 1); + + Kokkos::deep_copy(y0, 0.5); + Kokkos::deep_copy(y_vecs, 0.5); + + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, num_steps[idx], y0, y_new, rhs, + update, scale, y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); + Kokkos::fence(); + + auto y_new_h = Kokkos::create_mirror_view(y_new); + Kokkos::deep_copy(y_new_h, y_new); + + errors[idx] = Kokkos::abs(y_new_h(0) - 1 / (1 + Kokkos::exp(-t_end))) / + Kokkos::abs(1 / (1 + Kokkos::exp(-t_end))); + } + measured_order = + Kokkos::pow(errors[num_tests - 1] / errors[0], 1.0 / (num_tests - 1)); + EXPECT_NEAR_KK_REL(measured_order, 2.0, 0.15); +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "expected ratio: 2, actual ratio: " << measured_order + << ", order error=" << Kokkos::abs(measured_order - 2.0) / 2.0 + << std::endl; +#endif + + // Test BDF2 +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "\nBDF2 convergence test" << std::endl; +#endif + for (int idx = 0; idx < num_tests; idx++) { + mv_type y_vecs("history vectors", mySys.neqs, 2); + Kokkos::deep_copy(y0, 0.5); + + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, num_steps[idx], y0, y_new, rhs, + update, scale, y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); + Kokkos::fence(); + + auto y_new_h = Kokkos::create_mirror_view(y_new); + Kokkos::deep_copy(y_new_h, y_new); + + errors[idx] = Kokkos::abs(y_new_h(0) - 1 / (1 + Kokkos::exp(-t_end))) / + Kokkos::abs(1 / (1 + Kokkos::exp(-t_end))); + } + measured_order = + Kokkos::pow(errors[num_tests - 1] / errors[0], 1.0 / (num_tests - 1)); + EXPECT_NEAR_KK_REL(measured_order, 4.0, 0.15); +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "expected ratio: 4, actual ratio: " << measured_order + << ", order error=" << Kokkos::abs(measured_order - 4.0) / 4.0 + << std::endl; +#endif + + // Test BDF3 +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "\nBDF3 convergence test" << std::endl; +#endif + for (int idx = 0; idx < num_tests; idx++) { + mv_type y_vecs("history vectors", mySys.neqs, 3); + Kokkos::deep_copy(y0, 0.5); + + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, num_steps[idx], y0, y_new, rhs, + update, scale, y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); + Kokkos::fence(); + + auto y_new_h = Kokkos::create_mirror_view(y_new); + Kokkos::deep_copy(y_new_h, y_new); + + errors[idx] = Kokkos::abs(y_new_h(0) - 1 / (1 + Kokkos::exp(-t_end))) / + Kokkos::abs(1 / (1 + Kokkos::exp(-t_end))); + } + measured_order = + Kokkos::pow(errors[num_tests - 1] / errors[0], 1.0 / (num_tests - 1)); + EXPECT_NEAR_KK_REL(measured_order, 8.0, 0.15); +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "expected ratio: 8, actual ratio: " << measured_order + << ", order error=" << Kokkos::abs(measured_order - 8.0) / 8.0 + << std::endl; +#endif + + // Test BDF4 +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "\nBDF4 convergence test" << std::endl; +#endif + for (int idx = 0; idx < num_tests; idx++) { + mv_type y_vecs("history vectors", mySys.neqs, 4); + Kokkos::deep_copy(y0, 0.5); + + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, num_steps[idx], y0, y_new, rhs, + update, scale, y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); + Kokkos::fence(); + + auto y_new_h = Kokkos::create_mirror_view(y_new); + Kokkos::deep_copy(y_new_h, y_new); + + errors[idx] = Kokkos::abs(y_new_h(0) - 1 / (1 + Kokkos::exp(-t_end))) / + Kokkos::abs(1 / (1 + Kokkos::exp(-t_end))); + } + measured_order = + Kokkos::pow(errors[num_tests - 1] / errors[0], 1.0 / (num_tests - 1)); +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "expected ratio: 16, actual ratio: " << measured_order + << ", order error=" << Kokkos::abs(measured_order - 16.0) / 16.0 + << std::endl; +#endif + + // Test BDF5 +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "\nBDF5 convergence test" << std::endl; +#endif + for (int idx = 0; idx < num_tests; idx++) { + mv_type y_vecs("history vectors", mySys.neqs, 5); + Kokkos::deep_copy(y0, 0.5); + + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, num_steps[idx], y0, y_new, rhs, + update, scale, y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); + Kokkos::fence(); + + auto y_new_h = Kokkos::create_mirror_view(y_new); + Kokkos::deep_copy(y_new_h, y_new); + + errors[idx] = Kokkos::abs(y_new_h(0) - 1 / (1 + Kokkos::exp(-t_end))) / + Kokkos::abs(1 / (1 + Kokkos::exp(-t_end))); + } + measured_order = + Kokkos::pow(errors[num_tests - 1] / errors[0], 1.0 / (num_tests - 1)); +#if defined(HAVE_KOKKOSKERNELS_DEBUG) + std::cout << "expected ratio: 32, actual ratio: " << measured_order + << ", order error=" << Kokkos::abs(measured_order - 32.0) / 32.0 + << std::endl; +#endif + +} // test_BDF_Logistic + +template +void test_BDF_LotkaVolterra() { + using execution_space = typename device_type::execution_space; + using vec_type = Kokkos::View; + using mv_type = Kokkos::View; + using mat_type = Kokkos::View; + + LotkaVolterra mySys(1.1, 0.4, 0.1, 0.4); + + const scalar_type t_start = 0.0, t_end = 100.0; + vec_type y0("initial conditions", mySys.neqs), y_new("solution", mySys.neqs); + vec_type rhs("rhs", mySys.neqs), update("update", mySys.neqs); + vec_type scale("scaling factors", mySys.neqs); + mat_type jac("jacobian", mySys.neqs, mySys.neqs), + temp("temp storage", mySys.neqs, mySys.neqs + 4); + + Kokkos::deep_copy(scale, 1); + + // Test BDF5 + mv_type kstack("Startup RK vectors", 6, mySys.neqs); + mv_type y_vecs("history vectors", mySys.neqs, 5); + + Kokkos::deep_copy(y0, 10.0); + Kokkos::deep_copy(y_vecs, 10.0); + + Kokkos::RangePolicy myPolicy(0, 1); + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, 1000, y0, y_new, rhs, update, scale, + y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); +} + +template +void test_BDF_StiffChemistry() { + using execution_space = typename device_type::execution_space; + using vec_type = Kokkos::View; + using mv_type = Kokkos::View; + using mat_type = Kokkos::View; + + StiffChemistry mySys{}; + + const scalar_type t_start = 0.0, t_end = 500.0; + vec_type y0("initial conditions", mySys.neqs), y_new("solution", mySys.neqs); + vec_type rhs("rhs", mySys.neqs), update("update", mySys.neqs); + vec_type scale("scaling factors", mySys.neqs); + mat_type jac("jacobian", mySys.neqs, mySys.neqs), + temp("temp storage", mySys.neqs, mySys.neqs + 4); + + Kokkos::deep_copy(scale, 1); + + // Test BDF5 + mv_type kstack("Startup RK vectors", 6, mySys.neqs); + mv_type y_vecs("history vectors", mySys.neqs, 5); + + auto y0_h = Kokkos::create_mirror_view(y0); + y0_h(0) = 1.0; + y0_h(1) = 0.0; + y0_h(2) = 0.0; + Kokkos::deep_copy(y0, y0_h); + Kokkos::deep_copy(y_vecs, 0.0); + + Kokkos::RangePolicy myPolicy(0, 1); + BDFSolve_wrapper + solve_wrapper(mySys, t_start, t_end, 110000, y0, y_new, rhs, update, + scale, y_vecs, kstack, temp, jac); + Kokkos::parallel_for(myPolicy, solve_wrapper); +} + +// template +// struct BDFSolve_parallel { +// ode_type my_ode; +// scalar_type tstart, tend; +// int num_steps; +// vec_type y_old, y_new, rhs, update, scale; +// mv_type y_vecs, kstack; +// mat_type temp, jac; + +// BDFSolve_parallel(const ode_type& my_ode_, const scalar_type tstart_, +// const scalar_type tend_, const int num_steps_, +// const vec_type& y_old_, const vec_type& y_new_, +// const vec_type& rhs_, const vec_type& update_, +// const vec_type& scale_, +// const mv_type& y_vecs_, const mv_type& kstack_, +// const mat_type& temp_, const mat_type& jac_) +// : my_ode(my_ode_), +// tstart(tstart_), +// tend(tend_), +// num_steps(num_steps_), +// y_old(y_old_), +// y_new(y_new_), +// rhs(rhs_), +// update(update_), +// scale(scale_), +// y_vecs(y_vecs_), +// kstack(kstack_), +// temp(temp_), +// jac(jac_) {} + +// KOKKOS_FUNCTION +// void operator()(const int idx) const { +// auto local_y_old = Kokkos::subview( +// y_old, +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1))); +// auto local_y_new = Kokkos::subview( +// y_new, +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1))); +// auto local_rhs = Kokkos::subview( +// rhs, +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1))); +// auto local_update = Kokkos::subview( +// update, +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1))); + +// auto local_y_vecs = Kokkos::subview( +// y_vecs, +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1)), +// Kokkos::ALL()); +// auto local_kstack = Kokkos::subview( +// kstack, Kokkos::ALL(), +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1))); +// auto local_temp = Kokkos::subview( +// temp, +// Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + 1)), +// Kokkos::ALL()); +// auto local_jac = Kokkos::subview( +// jac, Kokkos::pair(my_ode.neqs * idx, my_ode.neqs * (idx + +// 1)), Kokkos::ALL()); + +// KokkosODE::Experimental::BDF::Solve( +// my_ode, tstart, tend, num_steps, local_y_old, local_y_new, local_rhs, +// local_update, scale, local_y_vecs, local_kstack, local_temp, +// local_jac); +// } +// }; + +// template +// void test_BDF_parallel() { +// using execution_space = typename device_type::execution_space; +// using vec_type = Kokkos::View; +// using mv_type = Kokkos::View; +// using mat_type = Kokkos::View; + +// LotkaVolterra mySys(1.1, 0.4, 0.1, 0.4); +// constexpr int num_solves = 1000; + +// vec_type scale("scaling factors", mySys.neqs); +// Kokkos::deep_copy(scale, 1); + +// const scalar_type t_start = 0.0, t_end = 100.0; +// vec_type y0("initial conditions", mySys.neqs * num_solves), +// y_new("solution", mySys.neqs * num_solves); +// vec_type rhs("rhs", mySys.neqs * num_solves), +// update("update", mySys.neqs * num_solves); +// mat_type jac("jacobian", mySys.neqs * num_solves, mySys.neqs), +// temp("temp storage", mySys.neqs * num_solves, mySys.neqs + 4); + +// // Test BDF5 +// mv_type y_vecs("history vectors", mySys.neqs * num_solves, 5), +// kstack("Startup RK vectors", 6, mySys.neqs * num_solves); + +// Kokkos::deep_copy(y0, 10.0); +// Kokkos::deep_copy(y_vecs, 10.0); + +// Kokkos::RangePolicy myPolicy(0, num_solves); +// BDFSolve_parallel +// solve_wrapper(mySys, t_start, t_end, 1000, y0, y_new, rhs, update, +// scale, y_vecs, +// kstack, temp, jac); +// Kokkos::parallel_for(myPolicy, solve_wrapper); + +// Kokkos::fence(); +// } + +template +void compute_coeffs(const int order, const scalar_type factor, + const mat_type& coeffs) { + std::cout << "compute_coeffs" << std::endl; + + coeffs(0, 0) = 1.0; + for (int colIdx = 0; colIdx < order; ++colIdx) { + coeffs(0, colIdx + 1) = 1.0; + for (int rowIdx = 0; rowIdx < order; ++rowIdx) { + coeffs(rowIdx + 1, colIdx + 1) = + ((rowIdx - factor * (colIdx + 1.0)) / (rowIdx + 1.0)) * + coeffs(rowIdx, colIdx + 1); + } + } +} + +template +void update_D(const int order, const scalar_type factor, const mat_type& coeffs, + const mat_type& tempD, const mat_type& D) { + auto subD = + Kokkos::subview(D, Kokkos::pair(0, order + 1), Kokkos::ALL); + auto subTempD = + Kokkos::subview(tempD, Kokkos::pair(0, order + 1), Kokkos::ALL); + + compute_coeffs(order, factor, coeffs); + auto R = Kokkos::subview(coeffs, Kokkos::pair(0, order + 1), + Kokkos::pair(0, order + 1)); + std::cout << "SerialGemm" << std::endl; + KokkosBatched::SerialGemm< + KokkosBatched::Trans::Transpose, KokkosBatched::Trans::NoTranspose, + KokkosBatched::Algo::Gemm::Blocked>::invoke(1.0, R, subD, 0.0, subTempD); + + compute_coeffs(order, 1.0, coeffs); + auto U = Kokkos::subview(coeffs, Kokkos::pair(0, order + 1), + Kokkos::pair(0, order + 1)); + std::cout << "SerialGemm" << std::endl; + KokkosBatched::SerialGemm< + KokkosBatched::Trans::Transpose, KokkosBatched::Trans::NoTranspose, + KokkosBatched::Algo::Gemm::Blocked>::invoke(1.0, U, subTempD, 0.0, subD); +} + +template +void test_Nordsieck() { + using execution_space = Kokkos::HostSpace; + StiffChemistry mySys{}; + + Kokkos::View R("coeffs", 6, 6), + U("coeffs", 6, 6); + Kokkos::View D("D", 8, mySys.neqs), + tempD("tmp", 8, mySys.neqs); + int order = 1; + double factor = 0.8; + + constexpr double t_start = 0.0, t_end = 500.0; + int max_steps = 200000; + double dt = (t_end - t_start) / max_steps; + + auto y0 = Kokkos::subview(D, 0, Kokkos::ALL()); + auto f = Kokkos::subview(D, 1, Kokkos::ALL()); + y0(0) = 1.0; + + mySys.evaluate_function(0, 0, y0, f); + for (int eqIdx = 0; eqIdx < mySys.neqs; ++eqIdx) { + f(eqIdx) *= dt; + } + + compute_coeffs(order, factor, R); + compute_coeffs(order, 1.0, U); + + { + std::cout << "R: " << std::endl; + for (int i = 0; i < order + 1; ++i) { + std::cout << "{ "; + for (int j = 0; j < order + 1; ++j) { + std::cout << R(i, j) << ", "; + } + std::cout << "}" << std::endl; + } + } + + std::cout << "D before update:" << std::endl; + std::cout << " { " << D(0, 0) << ", " << D(0, 1) << ", " << D(0, 2) << " }" + << std::endl; + std::cout << " { " << D(1, 0) << ", " << D(1, 1) << ", " << D(1, 2) << " }" + << std::endl; + update_D(order, factor, R, tempD, D); + + std::cout << "D after update:" << std::endl; + std::cout << " { " << D(0, 0) << ", " << D(0, 1) << ", " << D(0, 2) << " }" + << std::endl; + std::cout << " { " << D(1, 0) << ", " << D(1, 1) << ", " << D(1, 2) << " }" + << std::endl; +} + +template +void test_adaptive_BDF() { + using execution_space = typename device_type::execution_space; + using vec_type = Kokkos::View; + using mat_type = Kokkos::View; + + Logistic mySys(1, 1); + + constexpr double t_start = 0.0, t_end = 6.0, atol = 1.0e-6, rtol = 1.0e-4; + constexpr int num_steps = 512, max_newton_iters = 5; + int order = 1, num_equal_steps = 0; + double dt = (t_end - t_start) / num_steps; + double t = t_start; + + vec_type y0("initial conditions", mySys.neqs), y_new("solution", mySys.neqs); + vec_type rhs("rhs", mySys.neqs), update("update", mySys.neqs); + mat_type temp("buffer1", mySys.neqs, 23 + 2 * mySys.neqs + 4), + temp2("buffer2", 6, 7); + + // Initial condition + Kokkos::deep_copy(y0, 0.5); + + // Initialize D + auto D = Kokkos::subview(temp, Kokkos::ALL(), Kokkos::pair(2, 10)); + D(0, 0) = y0(0); + mySys.evaluate_function(0, 0, y0, rhs); + D(0, 1) = dt * rhs(0); + Kokkos::deep_copy(rhs, 0); + + std::cout << "**********************\n" + << " Step 1\n" + << "**********************" << std::endl; + + std::cout << "Initial conditions" << std::endl; + std::cout << " y0=" << y0(0) << ", t=" << t << ", dt=" << dt << std::endl; + + std::cout << "Initial D: {" << D(0, 0) << ", " << D(0, 1) << ", " << D(0, 2) + << ", " << D(0, 3) << ", " << D(0, 4) << ", " << D(0, 5) << ", " + << D(0, 6) << ", " << D(0, 7) << "}" << std::endl; + + KokkosODE::Impl::BDFStep(mySys, t, dt, t_end, order, num_equal_steps, + max_newton_iters, atol, rtol, 0.2, y0, y_new, rhs, + update, temp, temp2); + + for (int eqIdx = 0; eqIdx < mySys.neqs; ++eqIdx) { + y0(eqIdx) = y_new(eqIdx); + } + + std::cout << "**********************\n" + << " Step 2\n" + << "**********************" << std::endl; + + std::cout << " y0=" << y0(0) << ", t=" << t << ", dt: " << dt << std::endl; + + std::cout << "Initial D: {" << D(0, 0) << ", " << D(0, 1) << ", " << D(0, 2) + << ", " << D(0, 3) << ", " << D(0, 4) << ", " << D(0, 5) << ", " + << D(0, 6) << ", " << D(0, 7) << "}" << std::endl; + + KokkosODE::Impl::BDFStep(mySys, t, dt, t_end, order, num_equal_steps, + max_newton_iters, atol, rtol, 0.2, y0, y_new, rhs, + update, temp, temp2); + + for (int eqIdx = 0; eqIdx < mySys.neqs; ++eqIdx) { + y0(eqIdx) = y_new(eqIdx); + } + + std::cout << "**********************\n" + << " Step 3\n" + << "**********************" << std::endl; + + std::cout << " y0=" << y0(0) << ", t=" << t << ", dt: " << dt << std::endl; + + std::cout << "Initial D: {" << D(0, 0) << ", " << D(0, 1) << ", " << D(0, 2) + << ", " << D(0, 3) << ", " << D(0, 4) << ", " << D(0, 5) << ", " + << D(0, 6) << ", " << D(0, 7) << "}" << std::endl; + + KokkosODE::Impl::BDFStep(mySys, t, dt, t_end, order, num_equal_steps, + max_newton_iters, atol, rtol, 0.2, y0, y_new, rhs, + update, temp, temp2); + + std::cout << "Final t: " << t << ", y=" << y_new(0) << std::endl; + +} // test_adaptive_BDF() + +template +void test_adaptive_BDF_v2() { + using vec_type = Kokkos::View; + using mat_type = Kokkos::View; + using KAT = Kokkos::ArithTraits; + + std::cout << "\n\n\nBDF_v2 test starting\n" << std::endl; + + Logistic mySys(1, 1); + + const scalar_type t_start = KAT::zero(), + t_end = 6 * KAT::one(); //, atol = 1.0e-6, rtol = 1.0e-4; + vec_type y0("initial conditions", mySys.neqs), y_new("solution", mySys.neqs); + Kokkos::deep_copy(y0, 0.5); + + mat_type temp("buffer1", mySys.neqs, 23 + 2 * mySys.neqs + 4), + temp2("buffer2", 6, 7); + + { + scalar_type dt = KAT::zero(); + vec_type f0("initial value f", mySys.neqs); + mySys.evaluate_function(t_start, KAT::zero(), y0, f0); + KokkosODE::Impl::initial_step_size(mySys, 1, t_start, 1e-6, 1e-3, y0, f0, + temp, dt); + + std::cout << "Initial Step Size: dt=" << dt << std::endl; + } + + KokkosODE::Experimental::BDFSolve(mySys, t_start, t_end, 0.0117188, + (t_end - t_start) / 10, y0, y_new, temp, + temp2); +} + +template +void test_BDF_adaptive_stiff() { + using execution_space = typename Device::execution_space; + using vec_type = Kokkos::View; + using mat_type = Kokkos::View; + using KAT = Kokkos::ArithTraits; + + StiffChemistry mySys{}; + + const scalar_type t_start = KAT::zero(), t_end = 350 * KAT::one(); + scalar_type dt = KAT::zero(); + vec_type y0("initial conditions", mySys.neqs), y_new("solution", mySys.neqs); + + // Set initial conditions + auto y0_h = Kokkos::create_mirror_view(y0); + y0_h(0) = KAT::one(); + y0_h(1) = KAT::zero(); + y0_h(2) = KAT::zero(); + Kokkos::deep_copy(y0, y0_h); + + mat_type temp("buffer1", mySys.neqs, 23 + 2 * mySys.neqs + 4), + temp2("buffer2", 6, 7); + + Kokkos::RangePolicy policy(0, 1); + BDF_Solve_wrapper bdf_wrapper(mySys, t_start, t_end, dt, + (t_end - t_start) / 10, y0, y_new, temp, temp2); + + Kokkos::parallel_for(policy, bdf_wrapper); + + auto y_new_h = Kokkos::create_mirror_view(y_new); + Kokkos::deep_copy(y_new_h, y_new); + std::cout << "Stiff Chemistry solution at t=500: {" << y_new_h(0) << ", " + << y_new_h(1) << ", " << y_new_h(2) << "}" << std::endl; +} + +} // namespace Test + +TEST_F(TestCategory, BDF_Logistic_serial) { + ::Test::test_BDF_Logistic(); +} +TEST_F(TestCategory, BDF_LotkaVolterra_serial) { + ::Test::test_BDF_LotkaVolterra(); +} +TEST_F(TestCategory, BDF_StiffChemistry_serial) { + ::Test::test_BDF_StiffChemistry(); +} +// TEST_F(TestCategory, BDF_parallel_serial) { +// ::Test::test_BDF_parallel(); +// } +TEST_F(TestCategory, BDF_Nordsieck) { + ::Test::test_Nordsieck(); +} +// TEST_F(TestCategory, BDF_adaptive) { +// ::Test::test_adaptive_BDF(); +// ::Test::test_adaptive_BDF_v2(); +// } +TEST_F(TestCategory, BDF_StiffChemistry_adaptive) { + ::Test::test_BDF_adaptive_stiff(); +} diff --git a/ode/unit_test/Test_ODE_Newton.hpp b/ode/unit_test/Test_ODE_Newton.hpp index d235df1e56..45dd4adb6a 100644 --- a/ode/unit_test/Test_ODE_Newton.hpp +++ b/ode/unit_test/Test_ODE_Newton.hpp @@ -21,7 +21,8 @@ namespace Test { -template +template struct NewtonSolve_wrapper { using newton_params = KokkosODE::Experimental::Newton_params; @@ -32,10 +33,13 @@ struct NewtonSolve_wrapper { mat_type J, tmp; status_view status; + scale_type scale; + NewtonSolve_wrapper(const system_type& my_nls_, const newton_params& params_, const vec_type& x_, const vec_type& rhs_, const vec_type& update_, const mat_type& J_, - const mat_type& tmp_, const status_view& status_) + const mat_type& tmp_, const status_view& status_, + const scale_type& scale_) : my_nls(my_nls_), params(params_), x(x_), @@ -43,7 +47,8 @@ struct NewtonSolve_wrapper { update(update_), J(J_), tmp(tmp_), - status(status_) {} + status(status_), + scale(scale_) {} KOKKOS_FUNCTION void operator()(const int idx) const { @@ -71,7 +76,8 @@ struct NewtonSolve_wrapper { // Run Newton nonlinear solver status(idx) = KokkosODE::Experimental::Newton::Solve( - my_nls, params, local_J, local_tmp, local_x, local_rhs, local_update); + my_nls, params, local_J, local_tmp, local_x, local_rhs, local_update, + scale); } }; @@ -87,6 +93,9 @@ void run_newton_test(const system_type& mySys, Kokkos::View status("Newton status", 1); + vec_type scale("scaling factors", mySys.neqs); + Kokkos::deep_copy(scale, 1); + vec_type x("solution vector", mySys.neqs), rhs("right hand side vector", mySys.neqs); auto x_h = Kokkos::create_mirror_view(x); @@ -104,7 +113,7 @@ void run_newton_test(const system_type& mySys, Kokkos::RangePolicy my_policy(0, 1); NewtonSolve_wrapper solve_wrapper(mySys, params, x, rhs, update, J, tmp, - status); + status, scale); Kokkos::parallel_for(my_policy, solve_wrapper); @@ -205,6 +214,9 @@ void test_newton_status() { using vec_type = typename Kokkos::View; using mat_type = typename Kokkos::View; + vec_type scale("scaling factors", 1); + Kokkos::deep_copy(scale, 1); + double abs_tol, rel_tol; if (std::is_same_v) { rel_tol = 10e-5; @@ -227,7 +239,7 @@ void test_newton_status() { scalar_type solution[3] = {2.0, -1.0, 0.0}; #endif newton_solver_status newton_status[3] = { - newton_solver_status::NLS_SUCCESS, newton_solver_status::MAX_ITER, + newton_solver_status::NLS_SUCCESS, newton_solver_status::NLS_DIVERGENCE, newton_solver_status::LIN_SOLVE_FAIL}; vec_type x("solution vector", 1), rhs("right hand side vector", 1); auto x_h = Kokkos::create_mirror_view(x); @@ -242,7 +254,7 @@ void test_newton_status() { Kokkos::RangePolicy my_policy(0, 1); NewtonSolve_wrapper solve_wrapper(my_system, params, x, rhs, update, J, tmp, - status); + status, scale); Kokkos::parallel_for(my_policy, solve_wrapper); Kokkos::deep_copy(status_h, status); @@ -481,6 +493,9 @@ void test_newton_on_device() { system_type mySys{}; + vec_type scale("scaling factors", mySys.neqs); + Kokkos::deep_copy(scale, 1); + vec_type x("solution vector", mySys.neqs * num_systems); vec_type rhs("right hand side vector", mySys.neqs * num_systems); vec_type update("update", mySys.neqs * num_systems); @@ -503,7 +518,7 @@ void test_newton_on_device() { Kokkos::RangePolicy my_policy(0, num_systems); NewtonSolve_wrapper solve_wrapper(mySys, params, x, rhs, update, J, tmp, - status); + status, scale); Kokkos::parallel_for(my_policy, solve_wrapper); Kokkos::fence(); diff --git a/perf_test/CMakeLists.txt b/perf_test/CMakeLists.txt index cf1905d6d4..28271dfb0d 100644 --- a/perf_test/CMakeLists.txt +++ b/perf_test/CMakeLists.txt @@ -49,6 +49,7 @@ if (KokkosKernels_ENABLE_PERFTESTS) ADD_COMPONENT_SUBDIRECTORY(sparse) ADD_COMPONENT_SUBDIRECTORY(blas) ADD_COMPONENT_SUBDIRECTORY(ode) + ADD_COMPONENT_SUBDIRECTORY(lapack) ADD_SUBDIRECTORY(performance) #ADD_SUBDIRECTORY(common) diff --git a/perf_test/blas/blas3/KokkosBlas3_gemm_standalone_perf_test_benchmark.cpp b/perf_test/blas/blas3/KokkosBlas3_gemm_standalone_perf_test_benchmark.cpp index d617ffcdf3..cd7f194071 100644 --- a/perf_test/blas/blas3/KokkosBlas3_gemm_standalone_perf_test_benchmark.cpp +++ b/perf_test/blas/blas3/KokkosBlas3_gemm_standalone_perf_test_benchmark.cpp @@ -142,7 +142,19 @@ void run(const blas3_gemm_params& params) { int main(int argc, char** argv) { const auto params = blas3_gemm_params::get_params(argc, argv); const int num_threads = params.use_openmp; - const int device_id = params.use_cuda - 1; + + // the common parameter parser takes the requested device ID and + // adds 1 to it (e.g. --cuda 0 -> params.use_cuda = 1) + // this is presumably so that 0 can be a sentinel value, + // even though device ID 0 is valid + // here, we use CUDA, SYCL, or HIP, whichever is set first, to + // choose which device Kokkos should initialize on + // or -1, for no such selection + const int device_id = + params.use_cuda + ? params.use_cuda - 1 + : (params.use_sycl ? params.use_sycl - 1 + : (params.use_hip ? params.use_hip - 1 : -1)); Kokkos::initialize(Kokkos::InitializationSettings() .set_num_threads(num_threads) diff --git a/perf_test/lapack/CMakeLists.txt b/perf_test/lapack/CMakeLists.txt new file mode 100644 index 0000000000..478703d38a --- /dev/null +++ b/perf_test/lapack/CMakeLists.txt @@ -0,0 +1,8 @@ +KOKKOSKERNELS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR}) +KOKKOSKERNELS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}) + +if(KOKKOSKERNELS_ENABLE_BENCHMARK) + KOKKOSKERNELS_ADD_BENCHMARK( + lapack_svd SOURCES KokkosLapack_SVD_benchmark.cpp + ) +endif() diff --git a/perf_test/lapack/KokkosLapack_SVD_benchmark.cpp b/perf_test/lapack/KokkosLapack_SVD_benchmark.cpp new file mode 100644 index 0000000000..1ac9381ff8 --- /dev/null +++ b/perf_test/lapack/KokkosLapack_SVD_benchmark.cpp @@ -0,0 +1,124 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#include "KokkosLapack_svd.hpp" + +#include "KokkosKernels_TestUtils.hpp" +#include "KokkosKernels_perf_test_utilities.hpp" + +#include +#include "Benchmark_Context.hpp" + +struct svd_parameters { + int numRows, numCols; + bool verbose; + + svd_parameters(const int numRows_, const int numCols_, const bool verbose_) + : numRows(numRows_), numCols(numCols_), verbose(verbose_){}; +}; + +void print_options() { + std::cerr << "Options\n" << std::endl; + + std::cerr << perf_test::list_common_options(); + + std::cerr << "\t[Optional] --verbose :: enable verbose output" + << std::endl; + std::cerr << "\t[Optional] --m :: number of rows of A" << std::endl; + std::cerr << "\t[Optional] --n :: number of columns of A" + << std::endl; +} // print_options + +int parse_inputs(svd_parameters& params, int argc, char** argv) { + for (int i = 1; i < argc; ++i) { + if (perf_test::check_arg_int(i, argc, argv, "--m", params.numRows)) { + ++i; + } else if (perf_test::check_arg_int(i, argc, argv, "--n", params.numCols)) { + ++i; + } else if (perf_test::check_arg_bool(i, argc, argv, "--verbose", + params.verbose)) { + } else { + std::cerr << "Unrecognized command line argument #" << i << ": " + << argv[i] << std::endl; + print_options(); + return 1; + } + } + return 0; +} // parse_inputs + +template +void run_svd_benchmark(benchmark::State& state, + const svd_parameters& svd_params) { + using mat_type = Kokkos::View; + using vec_type = Kokkos::View; + + const int m = svd_params.numRows; + const int n = svd_params.numCols; + + mat_type A("A", m, n), U("U", m, m), Vt("Vt", n, n); + vec_type S("S", Kokkos::min(m, n)); + + const uint64_t seed = + std::chrono::high_resolution_clock::now().time_since_epoch().count(); + Kokkos::Random_XorShift64_Pool rand_pool(seed); + + // Initialize A with random numbers + double randStart = 0, randEnd = 0; + Test::getRandomBounds(10.0, randStart, randEnd); + Kokkos::fill_random(A, rand_pool, randStart, randEnd); + + for (auto _ : state) { + (void)_; + KokkosLapack::svd("A", "A", A, S, U, Vt); + Kokkos::fence(); + } +} + +int main(int argc, char** argv) { + Kokkos::initialize(argc, argv); + + benchmark::Initialize(&argc, argv); + benchmark::SetDefaultTimeUnit(benchmark::kMillisecond); + KokkosKernelsBenchmark::add_benchmark_context(true); + + perf_test::CommonInputParams common_params; + perf_test::parse_common_options(argc, argv, common_params); + svd_parameters svd_params(0, 0, false); + parse_inputs(svd_params, argc, argv); + + std::string bench_name = "KokkosLapack_SVD"; + + if (0 < common_params.repeat) { + benchmark::RegisterBenchmark( + bench_name.c_str(), run_svd_benchmark, + svd_params) + ->UseRealTime() + ->Iterations(common_params.repeat); + } else { + benchmark::RegisterBenchmark( + bench_name.c_str(), run_svd_benchmark, + svd_params) + ->UseRealTime(); + } + + benchmark::RunSpecifiedBenchmarks(); + + benchmark::Shutdown(); + Kokkos::finalize(); + + return 0; +} diff --git a/perf_test/ode/CMakeLists.txt b/perf_test/ode/CMakeLists.txt index b4aa86889f..39acabed98 100644 --- a/perf_test/ode/CMakeLists.txt +++ b/perf_test/ode/CMakeLists.txt @@ -5,4 +5,8 @@ if(KOKKOSKERNELS_ENABLE_BENCHMARK) KOKKOSKERNELS_ADD_BENCHMARK( ode_runge_kutta SOURCES KokkosODE_RK.cpp ) + + KOKKOSKERNELS_ADD_BENCHMARK( + ode_bdf_solver SOURCES KokkosODE_BDF.cpp + ) endif() diff --git a/perf_test/ode/KokkosODE_BDF.cpp b/perf_test/ode/KokkosODE_BDF.cpp new file mode 100644 index 0000000000..84a310666f --- /dev/null +++ b/perf_test/ode/KokkosODE_BDF.cpp @@ -0,0 +1,266 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#include "KokkosODE_BDF.hpp" + +#include "KokkosKernels_TestUtils.hpp" +#include "KokkosKernels_perf_test_utilities.hpp" + +#include +#include "Benchmark_Context.hpp" + +namespace { +// Robertson's autocatalytic chemical reaction: +// H. H. Robertson, The solution of a set of reaction rate equations, +// in J. Walsh (Ed.), Numerical Analysis: An Introduction, +// pp. 178–182, Academic Press, London (1966). +// +// Equations: y0' = -0.04*y0 + 10e4*y1*y2 +// y1' = 0.04*y0 - 10e4*y1*y2 - 3e7 * y1**2 +// y2' = 3e7 * y1**2 +struct StiffChemistry { + static constexpr int neqs = 3; + + StiffChemistry() {} + + template + KOKKOS_FUNCTION void evaluate_function(const double /*t*/, + const double /*dt*/, + const vec_type1& y, + const vec_type2& f) const { + f(0) = -0.04 * y(0) + 1.e4 * y(1) * y(2); + f(1) = 0.04 * y(0) - 1.e4 * y(1) * y(2) - 3.e7 * y(1) * y(1); + f(2) = 3.e7 * y(1) * y(1); + } + + template + KOKKOS_FUNCTION void evaluate_jacobian(const double /*t*/, + const double /*dt*/, const vec_type& y, + const mat_type& jac) const { + jac(0, 0) = -0.04; + jac(0, 1) = 1.e4 * y(2); + jac(0, 2) = 1.e4 * y(1); + jac(1, 0) = 0.04; + jac(1, 1) = -1.e4 * y(2) - 3.e7 * 2.0 * y(1); + jac(1, 2) = -1.e4 * y(1); + jac(2, 0) = 0.0; + jac(2, 1) = 3.e7 * 2.0 * y(1); + jac(2, 2) = 0.0; + } +}; + +template +struct BDF_Solve_wrapper { + const ode_type my_ode; + const scalar_type t_start, t_end, dt, max_step; + const vec_type y0, y_new; + const mat_type temp, temp2; + + BDF_Solve_wrapper(const ode_type& my_ode_, const scalar_type& t_start_, + const scalar_type& t_end_, const scalar_type& dt_, + const scalar_type& max_step_, const vec_type& y0_, + const vec_type& y_new_, const mat_type& temp_, + const mat_type& temp2_) + : my_ode(my_ode_), + t_start(t_start_), + t_end(t_end_), + dt(dt_), + max_step(max_step_), + y0(y0_), + y_new(y_new_), + temp(temp_), + temp2(temp2_) {} + + KOKKOS_FUNCTION void operator()(const int idx) const { + auto subTemp = Kokkos::subview(temp, Kokkos::ALL(), Kokkos::ALL(), idx); + auto subTemp2 = Kokkos::subview(temp2, Kokkos::ALL(), Kokkos::ALL(), idx); + auto subY0 = Kokkos::subview(y0, Kokkos::ALL(), idx); + auto subYnew = Kokkos::subview(y_new, Kokkos::ALL(), idx); + + KokkosODE::Experimental::BDFSolve(my_ode, t_start, t_end, dt, max_step, + subY0, subYnew, subTemp, subTemp2); + } +}; + +} // namespace + +struct bdf_input_parameters { + int num_odes; + int repeat; + bool verbose; + + bdf_input_parameters(const int num_odes_, const int repeat_, + const bool verbose_) + : num_odes(num_odes_), repeat(repeat_), verbose(verbose_){}; +}; + +template +void run_ode_chem(benchmark::State& state, const bdf_input_parameters& inputs) { + using scalar_type = double; + using KAT = Kokkos::ArithTraits; + using vec_type = Kokkos::View; + using mat_type = Kokkos::View; + + StiffChemistry mySys{}; + + const bool verbose = inputs.verbose; + const int num_odes = inputs.num_odes; + const int neqs = mySys.neqs; + + const scalar_type t_start = KAT::zero(), t_end = 350 * KAT::one(); + scalar_type dt = KAT::zero(); + vec_type y0("initial conditions", neqs, num_odes); + vec_type y_new("solution", neqs, num_odes); + + // Set initial conditions + auto y0_h = Kokkos::create_mirror(y0); + for (int sysIdx = 0; sysIdx < num_odes; ++sysIdx) { + y0_h(0, sysIdx) = KAT::one(); + y0_h(1, sysIdx) = KAT::zero(); + y0_h(2, sysIdx) = KAT::zero(); + } + + mat_type temp("buffer1", neqs, 23 + 2 * neqs + 4, num_odes), + temp2("buffer2", 6, 7, num_odes); + + if (verbose) { + std::cout << "Number of problems solved in parallel: " << num_odes + << std::endl; + } + + Kokkos::RangePolicy policy(0, num_odes); + + Kokkos::Timer time; + time.reset(); + for (auto _ : state) { + (void)_; + + // Set initial conditions for each test iteration + state.PauseTiming(); + dt = KAT::zero(); + Kokkos::deep_copy(y0, y0_h); + Kokkos::deep_copy(y_new, KAT::zero()); + Kokkos::deep_copy(temp, KAT::zero()); + Kokkos::deep_copy(temp2, KAT::zero()); + BDF_Solve_wrapper bdf_wrapper(mySys, t_start, t_end, dt, + (t_end - t_start) / 10, y0, y_new, temp, + temp2); + state.ResumeTiming(); + + // Actually run the time integrator + Kokkos::parallel_for(policy, bdf_wrapper); + Kokkos::fence(); + } + double run_time = time.seconds(); + std::cout << "Run time: " << run_time << std::endl; + + Kokkos::deep_copy(y0_h, y0); + double error; + for (int odeIdx = 0; odeIdx < num_odes; ++odeIdx) { + error = 0; + // error += Kokkos::abs(y0_h(0, odeIdx) - 0.4193639) / 0.4193639; + // error += Kokkos::abs(y0_h(1, odeIdx) - 0.000002843646) / 0.000002843646; + // error += Kokkos::abs(y0_h(2, odeIdx) - 0.5806333) / 0.5806333; + error += Kokkos::abs(y0_h(0, odeIdx) - 0.462966) / 0.462966; + error += Kokkos::abs(y0_h(1, odeIdx) - 3.42699e-06) / 3.42699e-06; + error += Kokkos::abs(y0_h(2, odeIdx) - 0.537030) / 0.537030; + error = error / 3; + + if (error > 1e-6) { + std::cout << "Large error in problem " << odeIdx << ": " << error + << std::endl; + } + } +} + +void print_options() { + std::cerr << "Options\n" << std::endl; + + std::cerr << perf_test::list_common_options(); + + std::cerr + << "\t[Optional] --repeat :: how many times to repeat overall test" + << std::endl; + std::cerr << "\t[Optional] --verbose :: enable verbose output" + << std::endl; + std::cerr << "\t[Optional] --n :: number of ode problems to solve" + << std::endl; +} // print_options + +int parse_inputs(bdf_input_parameters& params, int argc, char** argv) { + for (int i = 1; i < argc; ++i) { + if (perf_test::check_arg_int(i, argc, argv, "--n", params.num_odes)) { + ++i; + } else if (perf_test::check_arg_int(i, argc, argv, "--repeat", + params.repeat)) { + ++i; + } else if (perf_test::check_arg_bool(i, argc, argv, "--verbose", + params.verbose)) { + } else { + std::cerr << "Unrecognized command line argument #" << i << ": " + << argv[i] << std::endl; + print_options(); + return 1; + } + } + return 0; +} // parse_inputs + +template +void run_benchmark_wrapper(benchmark::State& state, + bdf_input_parameters params) { + run_ode_chem(state, params); +} + +int main(int argc, char** argv) { + Kokkos::initialize(argc, argv); + { + benchmark::Initialize(&argc, argv); + benchmark::SetDefaultTimeUnit(benchmark::kMillisecond); + KokkosKernelsBenchmark::add_benchmark_context(true); + + perf_test::CommonInputParams common_params; + perf_test::parse_common_options(argc, argv, common_params); + + std::string bench_name = "KokkosODE_BDF_Stiff_Chem"; + bdf_input_parameters params(1000, 1, false); + parse_inputs(params, argc, argv); + + if (0 < common_params.repeat) { + benchmark::RegisterBenchmark( + bench_name.c_str(), + run_benchmark_wrapper, params) + ->UseRealTime() + ->ArgNames({"n"}) + ->Args({params.num_odes}) + ->Iterations(common_params.repeat); + } else { + benchmark::RegisterBenchmark( + bench_name.c_str(), + run_benchmark_wrapper, params) + ->UseRealTime() + ->ArgNames({"n"}) + ->Args({params.num_odes}); + } + + benchmark::RunSpecifiedBenchmarks(); + + benchmark::Shutdown(); + } + Kokkos::finalize(); + + return 0; +} diff --git a/perf_test/performance/CMakeLists.txt b/perf_test/performance/CMakeLists.txt index 93d377ba60..601b33256c 100644 --- a/perf_test/performance/CMakeLists.txt +++ b/perf_test/performance/CMakeLists.txt @@ -7,9 +7,9 @@ KOKKOSKERNELS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}) # performance_example is a simple example of using it. #don't assert that this is defined anymore -#ASSERT_DEFINED(TPL_ENABLE_yaml-cpp) +#ASSERT_DEFINED(TPL_ENABLE_yamlcpp) -IF(TPL_ENABLE_yaml-cpp) +IF(${PACKAGE_NAME}_ENABLE_yamlcpp) KOKKOSKERNELS_ADD_UNIT_TEST( performance_validate diff --git a/perf_test/sparse/CMakeLists.txt b/perf_test/sparse/CMakeLists.txt index 8a994b4122..ef0bf7d995 100644 --- a/perf_test/sparse/CMakeLists.txt +++ b/perf_test/sparse/CMakeLists.txt @@ -145,4 +145,9 @@ if (KokkosKernels_ENABLE_BENCHMARK) if (Kokkos_CXX_COMPILER_ID STREQUAL HIPCC AND Kokkos_CXX_COMPILER_VERSION VERSION_LESS 5.3) target_link_libraries(KokkosKernels_sparse_spmv_bsr_benchmark PRIVATE -lstdc++fs) endif() + # IntelLLVM < 2023.1.0 (and possible higher versions too) have an underlying clang that has the std::filesystem + # in an experimental namespace and a different library + if (Kokkos_CXX_COMPILER_ID STREQUAL IntelLLVM AND Kokkos_CXX_COMPILER_VERSION VERSION_LESS_EQUAL 2023.1.0) + target_link_libraries(KokkosKernels_sparse_spmv_bsr_benchmark PRIVATE -lstdc++fs) + endif() endif() diff --git a/perf_test/sparse/KokkosSparse_kk_spmv.cpp b/perf_test/sparse/KokkosSparse_kk_spmv.cpp index 3f4893363a..194ee9afd4 100644 --- a/perf_test/sparse/KokkosSparse_kk_spmv.cpp +++ b/perf_test/sparse/KokkosSparse_kk_spmv.cpp @@ -28,85 +28,159 @@ #include #include #include +#include // for graph_max_degree #include #include "KokkosKernels_default_types.hpp" -typedef default_scalar Scalar; -typedef default_lno_t Ordinal; -typedef default_size_type Offset; - -template -void run_spmv(Ordinal numRows, Ordinal numCols, const char* filename, int loop, - int num_vecs, char mode, Scalar beta) { - typedef KokkosSparse::CrsMatrix - matrix_type; - typedef typename Kokkos::View mv_type; - typedef typename mv_type::HostMirror h_mv_type; - - srand(17312837); - matrix_type A; - if (filename) - A = KokkosSparse::Impl::read_kokkos_crst_matrix(filename); - else { - Offset nnz = 10 * numRows; - // note: the help text says the bandwidth is fixed at 0.01 * numRows - A = KokkosSparse::Impl::kk_generate_sparse_matrix( - numRows, numCols, nnz, 0, 0.01 * numRows); - } - numRows = A.numRows(); - numCols = A.numCols(); - - std::cout << "A is " << numRows << "x" << numCols << ", with " << A.nnz() - << " nonzeros\n"; - std::cout << "SpMV mode " << mode << ", " << num_vecs - << " vectors, beta = " << beta << ", multivectors are "; - std::cout << (std::is_same_v ? "LayoutLeft" - : "LayoutRight"); - std::cout << '\n'; - - mv_type x("X", numCols, num_vecs); - mv_type y("Y", numRows, num_vecs); - h_mv_type h_x = Kokkos::create_mirror_view(x); - h_mv_type h_y = Kokkos::create_mirror_view(y); - h_mv_type h_y_compare = Kokkos::create_mirror(y); - - for (int v = 0; v < num_vecs; v++) { - for (int i = 0; i < numCols; i++) { - h_x(i, v) = (Scalar)(1.0 * (rand() % 40) - 20.); - } - } +using Scalar = default_scalar; +using Ordinal = default_lno_t; +using Offset = default_size_type; +using KAT = Kokkos::ArithTraits; + +struct SPMVBenchmarking { + // note: CLI currently only allows square matrices to be randomly generated + // and nz/row is fixed at 10 + Ordinal num_rows = 110503; + Ordinal num_cols = 110503; + char mode = 'N'; + int loop = 100; + int num_vecs = 1; + Scalar beta = KAT::zero(); + std::string filename = ""; + bool flush_cache = false; + bool non_reuse = false; - Kokkos::deep_copy(x, h_x); + // Using the parameters above, run and time spmv where x and y use the given + // memory layout. + template + void run() { + using matrix_type = + KokkosSparse::CrsMatrix; + using mv_type = Kokkos::View; + using h_mv_type = typename mv_type::HostMirror; - // Benchmark - auto x0 = Kokkos::subview(x, Kokkos::ALL(), 0); - auto y0 = Kokkos::subview(y, Kokkos::ALL(), 0); - // Do 5 warm up calls (not timed) - for (int i = 0; i < 5; i++) { - if (num_vecs == 1) { - // run the rank-1 version - KokkosSparse::spmv(&mode, 1.0, A, x0, beta, y0); + srand(17312837); + matrix_type A; + if (filename != "") { + std::cout << "Reading A from file \"" << filename << "\"...\n"; + A = KokkosSparse::Impl::read_kokkos_crst_matrix( + filename.c_str()); + num_rows = A.numRows(); + num_cols = A.numCols(); } else { - // rank-2 - KokkosSparse::spmv(&mode, 1.0, A, x, beta, y); + std::cout << "Randomly generating A...\n"; + Offset nnz = 10 * num_rows; + // note: the help text says the bandwidth is fixed at 0.01 * numRows + A = KokkosSparse::Impl::kk_generate_sparse_matrix( + num_rows, num_cols, nnz, 0, 0.01 * num_rows); } - Kokkos::DefaultExecutionSpace().fence(); - } - Kokkos::Timer timer; - for (int i = 0; i < loop; i++) { - if (num_vecs == 1) { - // run the rank-1 version - KokkosSparse::spmv(&mode, 1.0, A, x0, beta, y0); - } else { - // rank-2 - KokkosSparse::spmv(&mode, 1.0, A, x, beta, y); + + std::cout << "A is " << A.numRows() << "x" << A.numCols() << ", with " + << A.nnz() << " nonzeros\n"; + std::cout << "Mean nnz/row: " << (double)A.nnz() / A.numRows() << '\n'; + std::cout << "Max nnz/row: " + << KokkosSparse::Impl::graph_max_degree< + Kokkos::DefaultExecutionSpace, Ordinal>(A.graph.row_map) + << '\n'; + std::cout << "SpMV mode " << mode << ", " << num_vecs + << " vectors, beta = " << beta << ", multivectors are "; + std::cout << (std::is_same_v ? "LayoutLeft" + : "LayoutRight"); + std::cout << '\n'; + + bool transpose_like = (mode == 'T') || (mode == 'H'); + + Ordinal xlen = transpose_like ? A.numRows() : A.numCols(); + Ordinal ylen = transpose_like ? A.numCols() : A.numRows(); + + mv_type x("X", xlen, num_vecs); + mv_type y("Y", ylen, num_vecs); + + h_mv_type h_x = Kokkos::create_mirror_view(x); + h_mv_type h_y = Kokkos::create_mirror_view(y); + h_mv_type h_y_compare = Kokkos::create_mirror(y); + + for (int v = 0; v < num_vecs; v++) { + for (Ordinal i = 0; i < xlen; i++) { + h_x(i, v) = (Scalar)(1.0 * (rand() % 40) - 20.); + } } - Kokkos::DefaultExecutionSpace().fence(); + + Kokkos::deep_copy(x, h_x); + + // Benchmark + auto x0 = Kokkos::subview(x, Kokkos::ALL(), 0); + auto y0 = Kokkos::subview(y, Kokkos::ALL(), 0); + + // Create handles for both rank-1 and rank-2 cases, + // even though only 1 will get used below (depending on num_vecs) + + KokkosSparse::SPMVHandle + handle_rank1; + KokkosSparse::SPMVHandle + handle_rank2; + // Assuming that 1GB is enough to fully clear the L3 cache of a CPU, or the + // L2 of a GPU. (Some AMD EPYC chips have 768 MB L3) + Kokkos::View cacheFlushData; + if (flush_cache) { + Kokkos::resize(cacheFlushData, 1024 * 1024 * 1024); + } + + Kokkos::DefaultExecutionSpace space; + + // Do 5 warm up calls (not timed). This will also initialize the handle. + for (int i = 0; i < 5; i++) { + if (num_vecs == 1) { + // run the rank-1 version + if (non_reuse) + KokkosSparse::spmv(space, &mode, 1.0, A, x0, beta, y0); + else + KokkosSparse::spmv(space, &handle_rank1, &mode, 1.0, A, x0, beta, y0); + } else { + // rank-2 + if (non_reuse) + KokkosSparse::spmv(space, &mode, 1.0, A, x, beta, y); + else + KokkosSparse::spmv(space, &handle_rank2, &mode, 1.0, A, x, beta, y); + } + space.fence(); + } + + double totalTime = 0; + Kokkos::Timer timer; + for (int i = 0; i < loop; i++) { + if (flush_cache) { + // Copy some non-zero data to the view multiple times to flush the + // cache. (nonzero in case the system has an optimized path for zero + // pages) + for (int rep = 0; rep < 4; rep++) + Kokkos::deep_copy(space, cacheFlushData, char(rep + 1)); + } + space.fence(); + timer.reset(); + if (num_vecs == 1) { + // run the rank-1 version + if (non_reuse) + KokkosSparse::spmv(space, &mode, 1.0, A, x0, beta, y0); + else + KokkosSparse::spmv(space, &handle_rank1, &mode, 1.0, A, x0, beta, y0); + } else { + // rank-2 + if (non_reuse) + KokkosSparse::spmv(space, &mode, 1.0, A, x, beta, y); + else + KokkosSparse::spmv(space, &handle_rank2, &mode, 1.0, A, x, beta, y); + } + space.fence(); + totalTime += timer.seconds(); + } + double avg_time = totalTime / loop; + std::cout << avg_time << " s\n"; } - double avg_time = timer.seconds() / loop; - std::cout << avg_time << " s\n"; -} +}; void print_help() { printf(" -s [nrows] : matrix dimension (square)\n"); @@ -117,8 +191,11 @@ void print_help() { " --layout left|right : memory layout of x/y. Default depends on " "build's default execution space\n"); printf( - " -m N|T : matrix apply mode: N (normal, default), T " - "(transpose)\n"); + " -m N|T|H|C : matrix apply mode:\n" + " N - normal, default\n" + " T - transpose\n" + " H - conjugate transpose\n" + " C - conjugate\n"); printf( " -f [file],-fb [file] : Read in Matrix Market (.mtx), or binary " "(.bin) matrix file.\n"); @@ -126,21 +203,21 @@ void print_help() { " -l [LOOP] : How many spmv to run to aggregate average " "time. \n"); printf(" -b beta : beta, as in y := Ax + (beta)y\n"); + printf( + " --flush : Flush the cache between each spmv call " + "(slow!)\n"); + printf( + " --non-reuse : Use non-reuse interface (without " + "SPMVHandle)\n"); } int main(int argc, char** argv) { - long long int size = 110503; // a prime number - char* filename = NULL; - - char mode = 'N'; + SPMVBenchmarking sb; char layout; if (std::is_same::value) layout = 'L'; else layout = 'R'; - int loop = 100; - int num_vecs = 1; - Scalar beta = 0.0; if (argc == 1) { print_help(); @@ -149,27 +226,31 @@ int main(int argc, char** argv) { for (int i = 0; i < argc; i++) { if ((strcmp(argv[i], "-s") == 0)) { - size = atoi(argv[++i]); + // only square matrices supported now + sb.num_rows = atoi(argv[++i]); + sb.num_cols = sb.num_rows; continue; } if ((strcmp(argv[i], "-f") == 0 || strcmp(argv[i], "-fb") == 0)) { - filename = argv[++i]; + sb.filename = argv[++i]; continue; } if ((strcmp(argv[i], "-l") == 0)) { - loop = atoi(argv[++i]); + sb.loop = atoi(argv[++i]); continue; } if ((strcmp(argv[i], "-m") == 0)) { - mode = toupper(argv[++i][0]); + sb.mode = toupper(argv[++i][0]); + if (sb.mode != 'N' && sb.mode != 'T' && sb.mode != 'C' && sb.mode != 'H') + throw std::invalid_argument("Mode must be one of N, T, C or H."); continue; } if ((strcmp(argv[i], "--nv") == 0)) { - num_vecs = atoi(argv[++i]); + sb.num_vecs = atoi(argv[++i]); continue; } if ((strcmp(argv[i], "-b") == 0)) { - beta = atof(argv[++i]); + sb.beta = atof(argv[++i]); continue; } if ((strcmp(argv[i], "--layout") == 0)) { @@ -180,6 +261,15 @@ int main(int argc, char** argv) { layout = 'R'; else throw std::runtime_error("Invalid layout"); + continue; + } + if ((strcmp(argv[i], "--flush") == 0)) { + sb.flush_cache = true; + continue; + } + if ((strcmp(argv[i], "--non-reuse") == 0)) { + sb.non_reuse = true; + continue; } if ((strcmp(argv[i], "--help") == 0) || (strcmp(argv[i], "-h") == 0)) { print_help(); @@ -190,11 +280,9 @@ int main(int argc, char** argv) { Kokkos::initialize(argc, argv); if (layout == 'L') - run_spmv(size, size, filename, loop, num_vecs, mode, - beta); + sb.template run(); else - run_spmv(size, size, filename, loop, num_vecs, mode, - beta); + sb.template run(); Kokkos::finalize(); } diff --git a/perf_test/sparse/KokkosSparse_spadd.cpp b/perf_test/sparse/KokkosSparse_spadd.cpp index 3b347eb903..a785ea82f6 100644 --- a/perf_test/sparse/KokkosSparse_spadd.cpp +++ b/perf_test/sparse/KokkosSparse_spadd.cpp @@ -303,8 +303,8 @@ void run_experiment(int argc, char** argv, CommonInputParams) { double numericTime = 0; // Do an untimed warm up symbolic, and preallocate space for C entries/values - spadd_symbolic(&kh, A.graph.row_map, A.graph.entries, B.graph.row_map, - B.graph.entries, row_mapC); + spadd_symbolic(exec_space{}, &kh, A.numRows(), A.numCols(), A.graph.row_map, + A.graph.entries, B.graph.row_map, B.graph.entries, row_mapC); bool use_kk = !params.use_cusparse && !params.use_mkl; @@ -366,7 +366,8 @@ void run_experiment(int argc, char** argv, CommonInputParams) { for (int sumRep = 0; sumRep < params.repeat; sumRep++) { timer.reset(); if (use_kk) { - spadd_symbolic(&kh, A.graph.row_map, A.graph.entries, B.graph.row_map, + spadd_symbolic(exec_space{}, &kh, A.numRows(), A.numCols(), + A.graph.row_map, A.graph.entries, B.graph.row_map, B.graph.entries, row_mapC); c_nnz = addHandle->get_c_nnz(); } else if (params.use_cusparse) { @@ -434,7 +435,8 @@ void run_experiment(int argc, char** argv, CommonInputParams) { } #endif } else { - spadd_numeric(&kh, A.graph.row_map, A.graph.entries, A.values, + spadd_numeric(exec_space{}, &kh, A.numRows(), A.numCols(), + A.graph.row_map, A.graph.entries, A.values, 1.0, // A, alpha B.graph.row_map, B.graph.entries, B.values, 1.0, // B, beta diff --git a/perf_test/sparse/KokkosSparse_spgemm_jacobi.cpp b/perf_test/sparse/KokkosSparse_spgemm_jacobi.cpp index 0f705e1209..33cb8a0f5f 100644 --- a/perf_test/sparse/KokkosSparse_spgemm_jacobi.cpp +++ b/perf_test/sparse/KokkosSparse_spgemm_jacobi.cpp @@ -237,17 +237,10 @@ int main(int argc, char** argv) { Kokkos::print_configuration(std::cout); #if defined(KOKKOS_ENABLE_OPENMP) - if (params.use_openmp) { -#ifdef KOKKOSKERNELS_INST_MEMSPACE_HBWSPACE - KokkosKernels::Experiment::run_spgemm_jacobi< - size_type, lno_t, scalar_t, Kokkos::OpenMP, - Kokkos::Experimental::HBWSpace, Kokkos::HostSpace>(params); -#else KokkosKernels::Experiment::run_spgemm_jacobi< size_type, lno_t, scalar_t, Kokkos::OpenMP, Kokkos::OpenMP::memory_space, Kokkos::OpenMP::memory_space>(params); -#endif } #endif diff --git a/perf_test/sparse/KokkosSparse_spiluk.cpp b/perf_test/sparse/KokkosSparse_spiluk.cpp index 331ae9ec82..c85b126019 100644 --- a/perf_test/sparse/KokkosSparse_spiluk.cpp +++ b/perf_test/sparse/KokkosSparse_spiluk.cpp @@ -144,12 +144,6 @@ int test_spiluk_perf(std::vector tests, std::string afilename, int kin, // std::cout << "Create handle" << std::endl; switch (test) { - case LVLSCHED_RP: - kh.create_spiluk_handle(SPILUKAlgorithm::SEQLVLSCHD_RP, nrows, - EXPAND_FACT * nnz * (fill_lev + 1), - EXPAND_FACT * nnz * (fill_lev + 1)); - kh.get_spiluk_handle()->print_algorithm(); - break; case LVLSCHED_TP1: kh.create_spiluk_handle(SPILUKAlgorithm::SEQLVLSCHD_TP1, nrows, EXPAND_FACT * nnz * (fill_lev + 1), diff --git a/perf_test/sparse/KokkosSparse_spmv_benchmark.cpp b/perf_test/sparse/KokkosSparse_spmv_benchmark.cpp index aeaa37db96..6adf55b26e 100644 --- a/perf_test/sparse/KokkosSparse_spmv_benchmark.cpp +++ b/perf_test/sparse/KokkosSparse_spmv_benchmark.cpp @@ -35,13 +35,20 @@ namespace { struct spmv_parameters { - int N, offset; + int N, offset, numvecs; + std::string mode; std::string filename; std::string alg; std::string tpl; spmv_parameters(const int N_) - : N(N_), offset(0), filename(""), alg(""), tpl("") {} + : N(N_), + offset(0), + numvecs(1), + mode(""), + filename(""), + alg(""), + tpl("") {} }; void print_options() { @@ -49,9 +56,11 @@ void print_options() { std::cerr << perf_test::list_common_options(); - std::cerr - << "\t[Optional] --repeat :: how many times to repeat overall test" - << std::endl; + std::cerr << "\t[Optional] --mode :: whether to run a suite of " + << "automated test or manually define one (auto, manual)" + << std::endl; + std::cerr << "\t[Optional] --repeat :: how many times to repeat overall " + << "test" << std::endl; std::cerr << " -n [N] :: generate a semi-random banded (band size " "0.01xN)\n" "NxN matrix with average of 10 entries per row." @@ -59,25 +68,30 @@ void print_options() { std::cerr << "\t[Optional] --alg :: the algorithm to run (default, " "native, merge)" << std::endl; - std::cerr - << "\t[Optional] --alg :: the algorithm to run (classic, merge)" - << std::endl; std::cerr << "\t[Optional] --TPL :: when available and compatible with " "alg, a TPL can be used (cusparse, rocsparse, MKL)" << std::endl; - std::cerr - << " -f [file] : Read in Matrix Market formatted text file 'file'." - << std::endl; + std::cerr << " -f [file] : Read in Matrix Market formatted text file" + << " 'file'." << std::endl; std::cerr << " --offset [O] : Subtract O from every index.\n" << " Useful in case the matrix market file is " "not 0 based." << std::endl; + std::cerr << " --num_vecs : The number of vectors stored in X and Y" + << std::endl; } // print_options void parse_inputs(int argc, char** argv, spmv_parameters& params) { for (int i = 1; i < argc; ++i) { if (perf_test::check_arg_int(i, argc, argv, "-n", params.N)) { ++i; + } else if (perf_test::check_arg_str(i, argc, argv, "--mode", params.alg)) { + if ((params.mode != "") && (params.mode != "auto") && + (params.alg != "manual")) { + throw std::runtime_error( + "--mode can only be an empty string, `auto` or `manual`!"); + } + ++i; } else if (perf_test::check_arg_str(i, argc, argv, "--alg", params.alg)) { if ((params.alg != "") && (params.alg != "default") && (params.alg != "native") && (params.alg != "merge")) { @@ -93,6 +107,9 @@ void parse_inputs(int argc, char** argv, spmv_parameters& params) { } else if (perf_test::check_arg_int(i, argc, argv, "--offset", params.offset)) { ++i; + } else if (perf_test::check_arg_int(i, argc, argv, "--num_vecs", + params.numvecs)) { + ++i; } else { print_options(); KK_USER_REQUIRE_MSG(false, "Unrecognized command line argument #" @@ -105,13 +122,21 @@ template void run_spmv(benchmark::State& state, const spmv_parameters& inputs) { using matrix_type = KokkosSparse::CrsMatrix; - using mv_type = Kokkos::View; - - KokkosKernels::Experimental::Controls controls; - if ((inputs.alg == "default") || (inputs.alg == "native") || - (inputs.alg == "merge")) { - controls.setParameter("algorithm", inputs.alg); + using mv_type = Kokkos::View; + using handle_t = + KokkosSparse::SPMVHandle; + + KokkosSparse::SPMVAlgorithm spmv_alg; + if ((inputs.alg == "default") || (inputs.alg == "")) { + spmv_alg = KokkosSparse::SPMVAlgorithm::SPMV_DEFAULT; + } else if (inputs.alg == "native") { + spmv_alg = KokkosSparse::SPMVAlgorithm::SPMV_NATIVE; + } else if (inputs.alg == "merge") { + spmv_alg = KokkosSparse::SPMVAlgorithm::SPMV_MERGE_PATH; + } else { + throw std::runtime_error("invalid spmv algorithm"); } + handle_t handle(spmv_alg); // Create test matrix srand(17312837); @@ -126,16 +151,17 @@ void run_spmv(benchmark::State& state, const spmv_parameters& inputs) { } // Create input vectors - mv_type x("X", A.numRows()); - mv_type y("Y", A.numCols()); + mv_type x("X", A.numRows(), inputs.numvecs); + mv_type y("Y", A.numCols(), inputs.numvecs); Kokkos::Random_XorShift64_Pool rand_pool(13718); Kokkos::fill_random(x, rand_pool, 10); Kokkos::fill_random(y, rand_pool, 10); + Kokkos::fence(); // Run the actual experiments for (auto _ : state) { - KokkosSparse::spmv(controls, KokkosSparse::NoTranspose, 1.0, A, x, 0.0, y); + KokkosSparse::spmv(&handle, KokkosSparse::NoTranspose, 1.0, A, x, 0.0, y); Kokkos::fence(); } } @@ -158,12 +184,25 @@ int main(int argc, char** argv) { spmv_parameters inputs(100000); parse_inputs(argc, argv, inputs); - // Google benchmark will report the wrong n if an input file matrix is used. - KokkosKernelsBenchmark::register_benchmark_real_time( - bench_name.c_str(), run_spmv, {"n"}, - {inputs.N}, common_params.repeat, inputs); - benchmark::RunSpecifiedBenchmarks(); + if ((inputs.mode == "") || (inputs.mode == "auto")) { + for (int n : {10000, 20000, 40000, 100000, 250000, 1000000}) { + for (int nv : {1, 2, 3, 4, 10}) { + inputs.N = n; + inputs.numvecs = nv; + KokkosKernelsBenchmark::register_benchmark_real_time( + bench_name.c_str(), run_spmv, + {"n", "nv"}, {inputs.N, inputs.numvecs}, common_params.repeat, + inputs); + } + } + } else { + // Google benchmark will report the wrong n if an input file matrix is used. + KokkosKernelsBenchmark::register_benchmark_real_time( + bench_name.c_str(), run_spmv, {"n"}, + {inputs.N}, common_params.repeat, inputs); + } + benchmark::RunSpecifiedBenchmarks(); benchmark::Shutdown(); Kokkos::finalize(); diff --git a/perf_test/sparse/KokkosSparse_spmv_bsr.cpp b/perf_test/sparse/KokkosSparse_spmv_bsr.cpp index d3b038f0e4..d96a3c6c8d 100644 --- a/perf_test/sparse/KokkosSparse_spmv_bsr.cpp +++ b/perf_test/sparse/KokkosSparse_spmv_bsr.cpp @@ -159,18 +159,22 @@ int test_bsr_matrix_single_vec( y_vector_type ycrs("crs_product_result", nRow); auto h_ycrs = Kokkos::create_mirror_view(ycrs); - KokkosKernels::Experimental::Controls controls; + KokkosSparse::SPMVAlgorithm algo = KokkosSparse::SPMV_DEFAULT; + switch (static_cast(test)) { case Implementation::KokkosKernels: { - controls.setParameter("algorithm", "native"); + algo = KokkosSparse::SPMV_NATIVE; } break; default: break; } + KokkosSparse::SPMVHandle + handle_crs(algo); // Do the multiplication for warming up for (Ordinal ir = 0; ir < nRow; ++ir) h_ycrs(ir) = h_y0(ir); Kokkos::deep_copy(ycrs, h_ycrs); - KokkosSparse::spmv(controls, fOp, alpha, Acrs, xref, beta, ycrs); + KokkosSparse::spmv(&handle_crs, fOp, alpha, Acrs, xref, beta, ycrs); // Time a series of multiplications with the CrsMatrix double time_crs = 0.0; @@ -178,7 +182,7 @@ int test_bsr_matrix_single_vec( for (Ordinal ir = 0; ir < nRow; ++ir) h_ycrs(ir) = h_y0(ir); Kokkos::deep_copy(ycrs, h_ycrs); Kokkos::Timer timer; - KokkosSparse::spmv(controls, fOp, alpha, Acrs, xref, beta, ycrs); + KokkosSparse::spmv(&handle_crs, fOp, alpha, Acrs, xref, beta, ycrs); time_crs += timer.seconds(); Kokkos::fence(); } @@ -192,10 +196,14 @@ int test_bsr_matrix_single_vec( scalar_t, Ordinal, Kokkos::DefaultExecutionSpace, void, int> Absr(Acrs, blockSize); + KokkosSparse::SPMVHandle + handle_bsr(algo); + // Do the multiplication for warming up for (Ordinal ir = 0; ir < nRow; ++ir) h_ybsr(ir) = h_y0(ir); Kokkos::deep_copy(ybsr, h_ybsr); - KokkosSparse::spmv(controls, fOp, alpha, Absr, xref, beta, ybsr); + KokkosSparse::spmv(&handle_bsr, fOp, alpha, Absr, xref, beta, ybsr); // Time a series of multiplications with the BsrMatrix double time_bsr = 0.0; @@ -203,7 +211,7 @@ int test_bsr_matrix_single_vec( for (Ordinal ir = 0; ir < nRow; ++ir) h_ybsr(ir) = h_y0(ir); Kokkos::deep_copy(ybsr, h_ybsr); Kokkos::Timer timer; - KokkosSparse::spmv(controls, fOp, alpha, Absr, xref, beta, ybsr); + KokkosSparse::spmv(&handle_bsr, fOp, alpha, Absr, xref, beta, ybsr); time_bsr += timer.seconds(); Kokkos::fence(); } @@ -316,19 +324,23 @@ int test_bsr_matrix_vec( block_vector_t ycrs("crs_product_result", nRow, nvec); auto h_ycrs = Kokkos::create_mirror_view(ycrs); - KokkosKernels::Experimental::Controls controls; + KokkosSparse::SPMVAlgorithm algo = KokkosSparse::SPMV_DEFAULT; + switch (static_cast(test)) { case Implementation::KokkosKernels: { - controls.setParameter("algorithm", "native"); + algo = KokkosSparse::SPMV_NATIVE; } break; default: break; } + KokkosSparse::SPMVHandle + handle_crs(algo); // Do the multiplication for warming up for (Ordinal jc = 0; jc < nvec; ++jc) for (Ordinal ir = 0; ir < nRow; ++ir) h_ycrs(ir, jc) = h_y0(ir, jc); Kokkos::deep_copy(ycrs, h_ycrs); - KokkosSparse::spmv(controls, fOp, alpha, Acrs, xref, beta, ycrs); + KokkosSparse::spmv(&handle_crs, fOp, alpha, Acrs, xref, beta, ycrs); // Time a series of multiplications with the CrsMatrix format double time_crs = 0.0; @@ -337,7 +349,7 @@ int test_bsr_matrix_vec( for (Ordinal ir = 0; ir < nRow; ++ir) h_ycrs(ir, jc) = h_y0(ir, jc); Kokkos::deep_copy(ycrs, h_ycrs); Kokkos::Timer timer; - KokkosSparse::spmv(controls, fOp, alpha, Acrs, xref, beta, ycrs); + KokkosSparse::spmv(&handle_crs, fOp, alpha, Acrs, xref, beta, ycrs); time_crs += timer.seconds(); Kokkos::fence(); } @@ -347,6 +359,10 @@ int test_bsr_matrix_vec( scalar_t, Ordinal, Kokkos::DefaultExecutionSpace, void, int> Absr(Acrs, blockSize); + KokkosSparse::SPMVHandle + handle_bsr(algo); + block_vector_t ybsr("bsr_product_result", nRow, nvec); auto h_ybsr = Kokkos::create_mirror_view(ybsr); @@ -354,7 +370,7 @@ int test_bsr_matrix_vec( for (Ordinal jc = 0; jc < nvec; ++jc) for (Ordinal ir = 0; ir < nRow; ++ir) h_ybsr(ir, jc) = h_y0(ir, jc); Kokkos::deep_copy(ybsr, h_ybsr); - KokkosSparse::spmv(controls, fOp, alpha, Absr, xref, beta, ybsr); + KokkosSparse::spmv(&handle_bsr, fOp, alpha, Absr, xref, beta, ybsr); // Time a series of multiplications with the BsrMatrix double time_bsr = 0.0; @@ -363,7 +379,7 @@ int test_bsr_matrix_vec( for (Ordinal ir = 0; ir < nRow; ++ir) h_ybsr(ir, jc) = h_y0(ir, jc); Kokkos::deep_copy(ybsr, h_ybsr); Kokkos::Timer timer; - KokkosSparse::spmv(controls, fOp, alpha, Absr, xref, beta, ybsr); + KokkosSparse::spmv(&handle_bsr, fOp, alpha, Absr, xref, beta, ybsr); time_bsr += timer.seconds(); Kokkos::fence(); } diff --git a/perf_test/sparse/KokkosSparse_spmv_bsr_benchmark.cpp b/perf_test/sparse/KokkosSparse_spmv_bsr_benchmark.cpp index 770b09cfb1..254a35c34f 100644 --- a/perf_test/sparse/KokkosSparse_spmv_bsr_benchmark.cpp +++ b/perf_test/sparse/KokkosSparse_spmv_bsr_benchmark.cpp @@ -207,9 +207,10 @@ struct SpmvNative { typename YView> static void spmv(const char *mode, const Alpha &alpha, const Matrix &crs, const XView &x, const Beta &beta, const YView &y) { - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "native"); - return KokkosSparse::spmv(controls, mode, alpha, crs, x, beta, y); + KokkosSparse::SPMVHandle + handle(KokkosSparse::SPMV_NATIVE); + return KokkosSparse::spmv(&handle, mode, alpha, crs, x, beta, y); } static std::string name() { return "native"; } @@ -221,9 +222,10 @@ struct SpmvV41 { typename YView> static void spmv(const char *mode, const Alpha &alpha, const Matrix &crs, const XView &x, const Beta &beta, const YView &y) { - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "v4.1"); - return KokkosSparse::spmv(controls, mode, alpha, crs, x, beta, y); + KokkosSparse::SPMVHandle + handle(KokkosSparse::SPMV_BSR_V41); + return KokkosSparse::spmv(&handle, mode, alpha, crs, x, beta, y); } static std::string name() { return "v4.1"; } @@ -473,4 +475,4 @@ int main(int argc, char **argv) { drop_cache(); Kokkos::finalize(); return 0; -} \ No newline at end of file +} diff --git a/perf_test/sparse/KokkosSparse_spmv_merge.cpp b/perf_test/sparse/KokkosSparse_spmv_merge.cpp index 6ad772116e..fdd2905b52 100644 --- a/perf_test/sparse/KokkosSparse_spmv_merge.cpp +++ b/perf_test/sparse/KokkosSparse_spmv_merge.cpp @@ -148,9 +148,8 @@ matrix_type generate_unbalanced_matrix( void print_help() { printf("SPMV merge benchmark code written by Luc Berger-Vergiat.\n"); - printf( - "The goal is to test cusSPARSE's merge algorithm on imbalanced " - "matrices."); + printf("The goal is to compare the merge path algorithm vs.\n"); + printf("TPLs and the KK native algorithm on imbalanced matrices.\n"); printf("Options:\n"); printf( " --compare : Compare the performance of the merge algo with the " @@ -233,35 +232,59 @@ int main(int argc, char** argv) { Kokkos::initialize(argc, argv); { - if (std::is_same::value) { - // Note that we template the matrix with entries=lno_t and offsets=lno_t - // to make sure it verifies the cusparse requirements - using matrix_type = - KokkosSparse::CrsMatrix; - using values_type = typename matrix_type::values_type::non_const_type; - const Scalar SC_ONE = Kokkos::ArithTraits::one(); - const Scalar alpha = SC_ONE + SC_ONE; - const Scalar beta = alpha + SC_ONE; - - matrix_type test_matrix = generate_unbalanced_matrix( - numRows, numEntries, numLongRows, numLongEntries); - - values_type y("right hand side", test_matrix.numRows()); - values_type x("left hand side", test_matrix.numCols()); - Kokkos::deep_copy(x, SC_ONE); - Kokkos::deep_copy(y, SC_ONE); - - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "merge"); - - // Perform a so called "warm-up" run - KokkosSparse::spmv(controls, "N", alpha, test_matrix, x, beta, y); - - double min_time = 1.0e32, max_time = 0.0, avg_time = 0.0; + // Note that we template the matrix with entries=lno_t and offsets=lno_t + // so that TPLs can be used + using matrix_type = + KokkosSparse::CrsMatrix; + using values_type = typename matrix_type::values_type::non_const_type; + using handle_type = + KokkosSparse::SPMVHandle; + const Scalar SC_ONE = Kokkos::ArithTraits::one(); + const Scalar alpha = SC_ONE + SC_ONE; + const Scalar beta = alpha + SC_ONE; + + matrix_type test_matrix = generate_unbalanced_matrix( + numRows, numEntries, numLongRows, numLongEntries); + + values_type y("right hand side", test_matrix.numRows()); + values_type x("left hand side", test_matrix.numCols()); + Kokkos::deep_copy(x, SC_ONE); + Kokkos::deep_copy(y, SC_ONE); + + handle_type handleMerge(KokkosSparse::SPMV_MERGE_PATH); + + // Perform a so called "warm-up" run + KokkosSparse::spmv(&handleMerge, "N", alpha, test_matrix, x, beta, y); + + double min_time = 1.0e32, max_time = 0.0, avg_time = 0.0; + for (int iterIdx = 0; iterIdx < loop; ++iterIdx) { + Kokkos::Timer timer; + KokkosSparse::spmv(&handleMerge, "N", alpha, test_matrix, x, beta, y); + Kokkos::fence(); + double time = timer.seconds(); + avg_time += time; + if (time > max_time) max_time = time; + if (time < min_time) min_time = time; + } + + std::cout << "KK Merge alg --- min: " << min_time << " max: " << max_time + << " avg: " << avg_time / loop << std::endl; + + // Run the cusparse default algorithm and native kokkos-kernels algorithm + // then output timings for comparison + if (compare) { + handle_type handleDefault; + // Warm up + KokkosSparse::spmv(&handleDefault, "N", alpha, test_matrix, x, beta, y); + + min_time = 1.0e32; + max_time = 0.0; + avg_time = 0.0; for (int iterIdx = 0; iterIdx < loop; ++iterIdx) { Kokkos::Timer timer; - KokkosSparse::spmv(controls, "N", alpha, test_matrix, x, beta, y); + KokkosSparse::spmv(&handleDefault, "N", alpha, test_matrix, x, beta, y); Kokkos::fence(); double time = timer.seconds(); avg_time += time; @@ -269,58 +292,28 @@ int main(int argc, char** argv) { if (time < min_time) min_time = time; } - std::cout << "cuSPARSE Merge alg --- min: " << min_time + std::cout << "Default alg --- min: " << min_time << " max: " << max_time << " avg: " << avg_time / loop << std::endl; - // Run the cusparse default algorithm and native kokkos-kernels algorithm - // then output timings for comparison - if (compare) { - controls.setParameter("algorithm", "default"); - - min_time = 1.0e32; - max_time = 0.0; - avg_time = 0.0; - for (int iterIdx = 0; iterIdx < loop; ++iterIdx) { - Kokkos::Timer timer; - KokkosSparse::spmv(controls, "N", alpha, test_matrix, x, beta, y); - Kokkos::fence(); - double time = timer.seconds(); - avg_time += time; - if (time > max_time) max_time = time; - if (time < min_time) min_time = time; - } - - std::cout << "cuSPARSE Default alg --- min: " << min_time - << " max: " << max_time << " avg: " << avg_time / loop - << std::endl; - - controls.setParameter("algorithm", "native"); - - min_time = 1.0e32; - max_time = 0.0; - avg_time = 0.0; - for (int iterIdx = 0; iterIdx < loop; ++iterIdx) { - Kokkos::Timer timer; - // KokkosSparse::spmv(controls, "N", alpha, test_matrix, x, beta, y); - KokkosSparse::Impl::spmv_beta(Kokkos::DefaultExecutionSpace{}, - controls, "N", alpha, test_matrix, x, - beta, y); - Kokkos::fence(); - double time = timer.seconds(); - avg_time += time; - if (time > max_time) max_time = time; - if (time < min_time) min_time = time; - } - - std::cout << "Kokkos Native alg --- min: " << min_time - << " max: " << max_time << " avg: " << avg_time / loop - << std::endl; + handle_type handleNative(KokkosSparse::SPMV_NATIVE); + KokkosSparse::spmv(&handleNative, "N", alpha, test_matrix, x, beta, y); + + min_time = 1.0e32; + max_time = 0.0; + avg_time = 0.0; + for (int iterIdx = 0; iterIdx < loop; ++iterIdx) { + Kokkos::Timer timer; + KokkosSparse::spmv(&handleNative, "N", alpha, test_matrix, x, beta, y); + Kokkos::fence(); + double time = timer.seconds(); + avg_time += time; + if (time > max_time) max_time = time; + if (time < min_time) min_time = time; } - } else { - std::cout << "The default execution space is not Cuda, nothing to do!" + + std::cout << "KK Native alg --- min: " << min_time + << " max: " << max_time << " avg: " << avg_time / loop << std::endl; } } diff --git a/perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp b/perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp index 02fcd1640a..85aab62122 100644 --- a/perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp +++ b/perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp @@ -581,7 +581,9 @@ int main(int argc, char** argv) { const double alpha = 1.0, beta = 1.0; size_t bufferSize = 0; void* dBuffer = NULL; -#if CUSPARSE_VERSION >= 11201 + +// CUSPARSE_MM_ALG_DEFAULT was deprecated in CUDA 11.2.1 a.k.a cuSPARSE 11.4.0 +#if CUSPARSE_VERSION >= 11400 cusparseSpMVAlg_t alg = CUSPARSE_SPMV_ALG_DEFAULT; #else cusparseSpMVAlg_t alg = CUSPARSE_MV_ALG_DEFAULT; diff --git a/scripts/cm_test_all_sandia b/scripts/cm_test_all_sandia index 28ef93b004..eb296091af 100755 --- a/scripts/cm_test_all_sandia +++ b/scripts/cm_test_all_sandia @@ -91,7 +91,10 @@ print_help() { echo "--with-tpls=TPLS: set KOKKOSKERNELS_ENABLE_TPLS" echo " Provide a comma-separated list of TPLs" echo " Valid items:" - echo " blas, mkl, cublas, cusparse, magma, armpl, rocblas, rocsparse" + echo " blas, mkl, cublas, cusparse, cusolver, magma, armpl, rocblas, rocsparse, rocsolver" + echo "" + echo "--cmake-flags=[CMAKE Command options]: Set Kokkos Kernels cmake options not handled by script" + echo "--kokkos-cmake-flags=[CMAKE Command options]: Set Kokkos cmake options not handled by script" echo "" echo "ARGS: list of expressions matching compilers to test" @@ -145,14 +148,8 @@ if [[ "$HOSTNAME" =~ weaver.* ]]; then module load git fi -if [[ "$HOSTNAME" =~ .*voltrino.* ]]; then - MACHINE=voltrino - module load git -fi - if [[ "$HOSTNAME" == *blake* ]]; then # Warning: very generic name MACHINE=blake - module load git fi if [[ "$HOSTNAME" == *solo* ]]; then # Warning: very generic name @@ -163,15 +160,6 @@ if [[ "$HOSTNAME" == kokkos-dev-2* ]]; then MACHINE=kokkos-dev-2 fi -if [[ "$HOSTNAME" == may* ]]; then - MACHINE=mayer -# module load git -fi - -if [[ "$HOSTNAME" == cn* ]]; then # Warning: very generic name - MACHINE=mayer -fi - if [[ "$HOSTNAME" == caraway* ]]; then # Warning: very generic name MACHINE=caraway fi @@ -210,7 +198,6 @@ fi echo "Running on machine: $MACHINE" GCC_BUILD_LIST="OpenMP,Threads,Serial,OpenMP_Serial,Threads_Serial" -IBM_BUILD_LIST="OpenMP,Serial,OpenMP_Serial" ARM_GCC_BUILD_LIST="OpenMP,Serial,OpenMP_Serial" INTEL_BUILD_LIST="OpenMP,Threads,Serial,OpenMP_Serial,Threads_Serial" CLANG_BUILD_LIST="Threads,Serial,Threads_Serial" @@ -218,7 +205,6 @@ CUDA_BUILD_LIST="Cuda_OpenMP,Cuda_Threads,Cuda_Serial" CUDA_IBM_BUILD_LIST="Cuda_OpenMP,Cuda_Serial" GCC_WARNING_FLAGS="-Wall,-Wunused-parameter,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wignored-qualifiers,-Wempty-body,-Wclobbered,-Wuninitialized" -IBM_WARNING_FLAGS="-Wall,-Wunused-parameter,-Wshadow,-pedantic,-Wsign-compare,-Wtype-limits,-Wuninitialized" CLANG_WARNING_FLAGS="-Wall,-Wunused-parameter,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized" INTEL_WARNING_FLAGS="-Wall,-Wunused-parameter,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized,-diag-disable=1011,-diag-disable=869" CUDA_WARNING_FLAGS="-Wall,-Wunused-parameter,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wuninitialized" @@ -418,6 +404,12 @@ do --with-tpls*) KOKKOSKERNELS_ENABLE_TPLS="${key#*=}" ;; + --cmake-flags*) + PASSTHRU_CMAKE_FLAGS="${key#*=}" + ;; + --kokkos-cmake-flags*) + KOKKOS_PASSTHRU_CMAKE_FLAGS="${key#*=}" + ;; --help*) PRINT_HELP=True ;; @@ -636,45 +628,14 @@ elif [ "$MACHINE" = "weaver" ]; then SPACK_HOST_ARCH="+power9" SPACK_CUDA_ARCH="+volta70" -elif [ "$MACHINE" = "voltrino" ]; then - SKIP_HWLOC=True - export SLURM_TASKS_PER_NODE=32 - - BASE_MODULE_LIST="PrgEnv-intel,craype-mic-knl,cmake/3.16.2,slurm/20.11.4a,/,gcc/9.3.0" - - # Format: (compiler module-list build-list exe-name warning-flag) - COMPILERS=("intel/19.0.4 $BASE_MODULE_LIST $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS" - ) - - if [ -z "$ARCH_FLAG" ]; then - ARCH_FLAG="--arch=KNL" - fi -elif [ "$MACHINE" = "mayer" ]; then - SKIP_HWLOC=True - export SLURM_TASKS_PER_NODE=96 - - BASE_MODULE_LIST="cmake/3.17.1,/" - - ARMCLANG_WARNING_FLAGS="-Wall,-Wshadow,-pedantic,-Wsign-compare,-Wtype-limits,-Wuninitialized" - - # Format: (compiler module-list build-list exe-name warning-flag) - COMPILERS=("gnu9/9.3.0 $BASE_MODULE_LIST $ARM_GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "arm/20.1 $BASE_MODULE_LIST $ARM_GCC_BUILD_LIST armclang++ $ARMCLANG_WARNING_FLAGS" - ) - - if [ -z "$ARCH_FLAG" ]; then - ARCH_FLAG="--arch=ARMV8_THUNDERX2" - fi - - SPACK_HOST_ARCH="+armv8_tx2" elif [ "$MACHINE" = "caraway" ]; then SKIP_HWLOC=True # BUILD_ONLY=True # report_and_log_test_result: only testing compilation of code for now, # output description and success based only on build succes; build time output (no run-time) - BASE_MODULE_LIST="cmake/3.19.3,/" - ROCM520_MODULE_LIST="$BASE_MODULE_LIST,openblas/0.3.20/rocm/5.2.0" + BASE_MODULE_LIST="cmake,/" + ROCM520_MODULE_LIST="$BASE_MODULE_LIST,openblas/0.3.20" HIPCLANG_BUILD_LIST="Hip_Serial" HIPCLANG_WARNING_FLAGS="" @@ -686,10 +647,7 @@ elif [ "$MACHINE" = "caraway" ]; then else # Format: (compiler module-list build-list exe-name warning-flag) COMPILERS=("rocm/5.2.0 $BASE_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" - "gcc/8.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/9.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/10.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/11.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" + "gcc/11.3.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" ) fi @@ -705,23 +663,23 @@ elif [ "$MACHINE" = "vega90a_caraway" ]; then # output description and success based only on build succes; build time output (no run-time) BASE_MODULE_LIST="cmake,/" - ROCM520_MODULE_LIST="$BASE_MODULE_LIST,openblas/0.3.20/rocm/5.2.0" + ROCM520_MODULE_LIST="$BASE_MODULE_LIST,openblas/0.3.20" + ROCM_TPL_MODULE_LIST="$BASE_MODULE_LIST,openblas/0.3.23" HIPCLANG_BUILD_LIST="Hip_Serial" HIPCLANG_WARNING_FLAGS="" if [ "$SPOT_CHECK_TPLS" = "True" ]; then # Format: (compiler module-list build-list exe-name warning-flag) - COMPILERS=("rocm/5.6.0 $ROCM520_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" + COMPILERS=("rocm/5.6.1 $ROCM_TPL_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" + "rocm/6.0.0 $ROCM_TPL_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" ) else # Format: (compiler module-list build-list exe-name warning-flag) - COMPILERS=("rocm/5.6.0 $BASE_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" + COMPILERS=("rocm/5.2.0 $BASE_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" "rocm/5.6.1 $BASE_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" - "gcc/8.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/9.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/10.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/11.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" + "rocm/6.0.0 $BASE_MODULE_LIST $HIPCLANG_BUILD_LIST hipcc $HIPCLANG_WARNING_FLAGS" + "gcc/11.3.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" ) fi @@ -731,53 +689,56 @@ elif [ "$MACHINE" = "vega90a_caraway" ]; then ARCH_FLAG="--arch=VEGA90A" fi elif [ "$MACHINE" = "blake" ]; then + MODULE_ENVIRONMENT="source /projects/x86-64-icelake-rocky8/spack-config/blake-setup-user-module-env.sh" eval "$MODULE_ENVIRONMENT" SKIP_HWLOC=True - export SLURM_TASKS_PER_NODE=32 - - module load cmake/3.19.3 - BASE_MODULE_LIST="cmake/3.19.3,/" - BASE_MODULE_LIST_INTEL="cmake/3.19.3,/compilers/" - BASE_MODULE_LIST_ONEAPI="cmake/3.19.3,/oneAPI/base-toolkit/,/oneAPI/hpc-toolkit/" - ONEAPI_WARNING_FLAGS="" + module load cmake - GCC102_MODULE_TPL_LIST="$BASE_MODULE_LIST,openblas/0.3.21/gcc/10.2.0" + BASE_MODULE_LIST="cmake,/" + BASE_MODULE_LIST_TPLS="cmake,/,openblas/0.3.23" + BASE_MODULE_LIST_ONEAPI_202310="cmake,-oneapi-compilers/,intel-oneapi-dpl/2022.1.0,intel-oneapi-mkl/2023.1.0,intel-oneapi-tbb/2021.9.0" + BASE_MODULE_LIST_ONEAPI_202320="cmake,-oneapi-compilers/,intel-oneapi-dpl/2022.2.0,intel-oneapi-mkl/2023.2.0,intel-oneapi-tbb/2021.10.0" + ONEAPI_FLAGS_EXTRA="-fp-model=precise" + LLVM_EXTRA_FLAGS="-fPIC ${CLANG_WARNING_FLAGS}" + # Remove -Wuninitialized: compiler issues show up with Threads backend + GCC11_WARNING_FLAGS="-Wall,-Wunused-parameter,-Wshadow,-pedantic,-Werror,-Wsign-compare,-Wtype-limits,-Wignored-qualifiers,-Wempty-body,-Wclobbered" + # update KOKKOS_PASSTHRU_CMAKE_FLAGS to disable onedpl on Blake + KOKKOS_PASSTHRU_CMAKE_FLAGS="${KOKKOS_PASSTHRU_CMAKE_FLAGS} -DKokkos_ENABLE_ONEDPL=OFF" if [ "$SPOT_CHECK" = "True" ]; then # Format: (compiler module-list build-list exe-name warning-flag) - # TODO: Failing toolchains: - #"intel/18.1.163 $BASE_MODULE_LIST_INTEL $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS" - #"pgi/18.7.0 $BASE_MODULE_LIST $GCC_BUILD_LIST pgc++ $PGI_WARNING_FLAGS" - COMPILERS=("clang/10.0.1 $BASE_MODULE_LIST "Threads_Serial" clang++ $CLANG_WARNING_FLAGS" - "intel/19.5.281 $BASE_MODULE_LIST_INTEL "OpenMP,Threads" icpc $INTEL_WARNING_FLAGS" - "gcc/10.2.0 $BASE_MODULE_LIST "Threads_Serial,OpenMP" g++ $GCC_WARNING_FLAGS" - "gcc/11.2.0 $BASE_MODULE_LIST "Threads_Serial,OpenMP" g++ $GCC_WARNING_FLAGS" + COMPILERS=("intel/2023.1.0 $BASE_MODULE_LIST_ONEAPI_202310 "OpenMP,Threads,Serial" icpx $ONEAPI_FLAGS_EXTRA" + "intel/2023.2.0 $BASE_MODULE_LIST_ONEAPI_202320 "OpenMP,Threads,Serial" icpx $ONEAPI_FLAGS_EXTRA" + "llvm/15.0.7 $BASE_MODULE_LIST "Threads,Serial" clang++ $LLVM_EXTRA_FLAGS" + "gcc/11.3.0 $BASE_MODULE_LIST "OpenMP,Threads,Serial" g++ $GCC11_WARNING_FLAGS" + "gcc/12.2.0 $BASE_MODULE_LIST "OpenMP,Threads,Serial" g++ $GCC11_WARNING_FLAGS" ) elif [ "$SPOT_CHECK_TPLS" = "True" ]; then # Format: (compiler module-list build-list exe-name warning-flag) - COMPILERS=("intel/19.5.281 $BASE_MODULE_LIST_INTEL "OpenMP,Threads" icpc $INTEL_WARNING_FLAGS" - "gcc/10.2.0 $GCC102_MODULE_TPL_LIST "OpenMP_Serial" g++ $GCC_WARNING_FLAGS" + # Known issues: + # gcc/12.2.0+openblas/0.3.23 with OpenMP: internal compiler error: in get_vectype_for_scalar_type, at tree-vect-stmts + COMPILERS=("intel/2023.1.0 $BASE_MODULE_LIST_ONEAPI_202310 "OpenMP,Threads,Serial" icpx $ONEAPI_FLAGS_EXTRA" + "intel/2023.2.0 $BASE_MODULE_LIST_ONEAPI_202320 "OpenMP,Threads,Serial" icpx $ONEAPI_FLAGS_EXTRA" + "llvm/15.0.7 $BASE_MODULE_LIST_TPLS "Threads,Serial" clang++ $LLVM_EXTRA_FLAGS" + "gcc/11.3.0 $BASE_MODULE_LIST_TPLS "OpenMP,Threads,Serial" g++ $GCC11_WARNING_FLAGS" + "gcc/12.2.0 $BASE_MODULE_LIST_TPLS "OpenMP,Threads,Serial" g++ $GCC11_WARNING_FLAGS" ) else - COMPILERS=("intel/19.5.281 $BASE_MODULE_LIST_INTEL $INTEL_BUILD_LIST icpc $INTEL_WARNING_FLAGS" - "intel/2021.2.0 $BASE_MODULE_LIST_ONEAPI $INTEL_BUILD_LIST icpx $ONEAPI_WARNING_FLAGS" - "intel/2021.4.0 $BASE_MODULE_LIST_ONEAPI $INTEL_BUILD_LIST icpx $ONEAPI_WARNING_FLAGS" - "intel/2022.1.2 $BASE_MODULE_LIST_ONEAPI $INTEL_BUILD_LIST icpx $ONEAPI_WARNING_FLAGS" - "gcc/8.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/8.3.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/9.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/10.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "gcc/11.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC_WARNING_FLAGS" - "clang/10.0.1 $BASE_MODULE_LIST $CLANG_BUILD_LIST clang++ $CLANG_WARNING_FLAGS" + # gcc/12.2.0 with OpenMP: internal compiler error: in get_vectype_for_scalar_type, at tree-vect-stmts + COMPILERS=("intel/2023.1.0 $BASE_MODULE_LIST_ONEAPI_202310 $INTEL_BUILD_LIST icpx $ONEAPI_FLAGS_EXTRA" + "intel/2023.2.0 $BASE_MODULE_LIST_ONEAPI_202320 $INTEL_BUILD_LIST icpx $ONEAPI_FLAGS_EXTRA" + "llvm/15.0.7 $BASE_MODULE_LIST $CLANG_BUILD_LIST clang++ $LLVM_EXTRA_FLAGS" + "gcc/11.3.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC11_WARNING_FLAGS" + "gcc/12.2.0 $BASE_MODULE_LIST $GCC_BUILD_LIST g++ $GCC11_WARNING_FLAGS" ) fi if [ -z "$ARCH_FLAG" ]; then - ARCH_FLAG="--arch=SKX" + ARCH_FLAG="--arch=SPR" fi - SPACK_HOST_ARCH="+skx" + SPACK_HOST_ARCH="+spr" elif [ "$MACHINE" = "solo" ]; then SKIP_HWLOC=True export SLURM_TASKS_PER_NODE=32 @@ -792,8 +753,7 @@ elif [ "$MACHINE" = "solo" ]; then GNU102_MODULE_TPL_LIST="$BASE_MODULE_LIST,openblas/0.3.21" if [ "$SPOT_CHECK" = "True" ]; then - COMPILERS=( - "gnu/10.2.1 $BASE_MODULE_LIST "Threads_Serial,OpenMP" g++ $GNU_WARNING_FLAGS" + COMPILERS=("gnu/10.2.1 $BASE_MODULE_LIST "Threads_Serial,OpenMP" g++ $GNU_WARNING_FLAGS" "llvm/10.0.1 $BASE_MODULE_LIST_LLVM "Threads_Serial" clang++ $CLANG_WARNING_FLAGS" ) elif [ "$SPOT_CHECK_TPLS" = "True" ]; then @@ -802,8 +762,7 @@ elif [ "$MACHINE" = "solo" ]; then ) else ###"clang/10.0.1 $BASE_MODULE_LIST $CLANG_BUILD_LIST clang++ $CLANG_WARNING_FLAGS" - COMPILERS=( - "gnu/10.2.1 $BASE_MODULE_LIST $GNU_BUILD_LIST g++ $GNU_WARNING_FLAGS" + COMPILERS=("gnu/10.2.1 $BASE_MODULE_LIST $GNU_BUILD_LIST g++ $GNU_WARNING_FLAGS" ) fi @@ -884,6 +843,7 @@ fi export OMP_NUM_THREADS=${omp_num_threads:=8} export OMP_PROC_BIND=${omp_proc_bind:=spread} export OMP_PLACES=${omp_places:=cores} +export KOKKOS_NUM_THREADS=8 declare -i NUM_RESULTS_TO_KEEP=7 @@ -947,6 +907,7 @@ if [ "$COMPILERS_TO_TEST" == "" ]; then exit 1 fi + # # Functions. # @@ -1082,11 +1043,11 @@ setup_env() { if [[ "${SPOT_CHECK_TPLS}" = "True" ]]; then # device tpls if [[ "$compiler" == cuda* ]]; then - NEW_TPL_LIST="cublas,cusparse," + NEW_TPL_LIST="cublas,cusparse,cusolver," export KOKKOS_CUDA_OPTIONS="${KOKKOS_CUDA_OPTIONS},enable_lambda" fi if [[ "$compiler" == rocm* ]]; then - NEW_TPL_LIST="rocblas,rocsparse," + NEW_TPL_LIST="rocblas,rocsparse,rocsolver," fi # host tpls - use mkl with intel, else use host blas if [[ "$compiler" == intel* ]]; then @@ -1120,10 +1081,9 @@ setup_env() { if [[ "${SPOT_CHECK_TPLS}" = "True" ]]; then # Some machines will require explicitly setting include dirs and libs - if ([[ "$MACHINE" = weaver* ]] || [[ "$MACHINE" = blake* ]] || [[ "$MACHINE" = sogpu* ]]) && [[ "$mod" = openblas* ]]; then + if ([[ "$MACHINE" = weaver* ]] || [[ "$MACHINE" = sogpu* ]]) && [[ "$mod" = openblas* ]]; then BLAS_LIBRARY_DIRS="${OPENBLAS_ROOT}/lib" LAPACK_LIBRARY_DIRS="${OPENBLAS_ROOT}/lib" - # BLAS_LIBRARIES="openblas" BLAS_LIBRARIES="blas" LAPACK_LIBRARIES="lapack" KOKKOSKERNELS_TPL_PATH_CMD="--user-blas-path=${BLAS_LIBRARY_DIRS} --user-lapack-path=${LAPACK_LIBRARY_DIRS}" @@ -1131,6 +1091,16 @@ setup_env() { KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD="--extra-linker-flags=-lgfortran,-lm" echo "TPL PATHS: KOKKOSKERNELS_TPL_PATH_CMD=$KOKKOSKERNELS_TPL_PATH_CMD" echo "TPL LIBS: KOKKOSKERNELS_TPL_LIBS_CMD=$KOKKOSKERNELS_TPL_LIBS_CMD" + elif [[ "$MACHINE" = blake* ]] && [[ "$mod" = openblas* ]]; then + BLAS_LIBRARY_DIRS="${OPENBLAS_ROOT}/lib" + LAPACK_LIBRARY_DIRS="${OPENBLAS_ROOT}/lib" + BLAS_LIBRARIES="openblas" + LAPACK_LIBRARIES="openblas" + KOKKOSKERNELS_TPL_PATH_CMD="--user-blas-path=${BLAS_LIBRARY_DIRS} --user-lapack-path=${LAPACK_LIBRARY_DIRS}" + KOKKOSKERNELS_TPL_LIBS_CMD="--user-blas-lib=${BLAS_LIBRARIES} --user-lapack-lib=${LAPACK_LIBRARIES}" + KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD="--extra-linker-flags=-lgfortran,-lm" + echo "TPL PATHS: KOKKOSKERNELS_TPL_PATH_CMD=$KOKKOSKERNELS_TPL_PATH_CMD" + echo "TPL LIBS: KOKKOSKERNELS_TPL_LIBS_CMD=$KOKKOSKERNELS_TPL_LIBS_CMD" elif ([[ "$MACHINE" = weaver* ]]) && [[ "$mod" = netlib* ]]; then BLAS_LIBRARY_DIRS="${BLAS_ROOT}/lib" LAPACK_LIBRARY_DIRS="${BLAS_ROOT}/lib" @@ -1161,8 +1131,9 @@ single_build_and_test() { # Set up env. local compiler_modules_list=$(get_compiler_modules $compiler) - mkdir -p $ROOT_DIR/$compiler/"${build}-$build_type" - cd $ROOT_DIR/$compiler/"${build}-$build_type" + local BUILD_AND_TEST_DIR=$ROOT_DIR/$compiler/"${build}-$build_type" + mkdir -p $BUILD_AND_TEST_DIR + cd $BUILD_AND_TEST_DIR local kokkos_variants=$(get_kokkos_variants $compiler) local kernels_variants=$(get_kernels_variants $compiler) @@ -1205,6 +1176,7 @@ single_build_and_test() { echo " export OMP_NUM_THREADS=$omp_num_threads" &>> reload_modules.sh echo " export OMP_PROC_BIND=$omp_proc_bind" &>> reload_modules.sh echo " export OMP_PLACES=$omp_places" &>> reload_modules.sh + echo " export KOKKOS_NUM_THREADS=8" &>> reload_modules.sh echo "" &>> reload_modules.sh chmod +x reload_modules.sh @@ -1284,6 +1256,7 @@ single_build_and_test() { HIP_ENABLE_CMD="--with-hip" fi local arch_code=$(echo $ARCH_FLAG | cut -d "=" -f 2) + local tpl_list_print=$(echo $KOKKOSKERNELS_ENABLE_TPL_CMD | cut -d "=" -f2-) echo "kokkos devices: ${LOCAL_KOKKOS_DEVICES}" echo "kokkos arch: ${arch_code}" echo "kokkos options: ${KOKKOS_OPTIONS}" @@ -1294,16 +1267,17 @@ single_build_and_test() { echo "kokkoskernels ordinals: ${KOKKOSKERNELS_ORDINALS}" echo "kokkoskernels offsets: ${KOKKOSKERNELS_OFFSETS}" echo "kokkoskernels layouts: ${KOKKOSKERNELS_LAYOUTS}" + echo "kokkoskernels tpls list: ${tpl_list_print}" # KOKKOS_OPTIONS and KOKKOS_CUDA_OPTIONS are exported and detected by kokkos' generate_makefile.sh during install of kokkos; we pass them to the reproducer script instructions echo " # Use generate_makefile line below to call cmake which generates makefile for this build:" &> call_generate_makefile.sh - echo " ${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-devices=$LOCAL_KOKKOS_DEVICES $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --cxxstandard=\"$cxx_standard\" --ldflags=\"$ldflags\" $CUDA_ENABLE_CMD $HIP_ENABLE_CMD --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=$kk_scalars --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --with-layouts=${KOKKOSKERNELS_LAYOUTS} ${KOKKOSKERNELS_ENABLE_TPL_CMD} ${KOKKOSKERNELS_TPL_PATH_CMD} ${KOKKOSKERNELS_TPL_LIBS_CMD} ${KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD} --with-options=${KOKKOS_OPTIONS} --with-cuda-options=${KOKKOS_CUDA_OPTIONS} ${KOKKOS_BOUNDS_CHECK} ${KOKKOSKERNELS_SPACES} --no-examples ${KOKKOS_DEPRECATED_CODE} $extra_args" &>> call_generate_makefile.sh + echo " ${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-devices=$LOCAL_KOKKOS_DEVICES $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --cxxstandard=\"$cxx_standard\" --ldflags=\"$ldflags\" $CUDA_ENABLE_CMD $HIP_ENABLE_CMD --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=$kk_scalars --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --with-layouts=${KOKKOSKERNELS_LAYOUTS} ${KOKKOSKERNELS_ENABLE_TPL_CMD} ${KOKKOSKERNELS_TPL_PATH_CMD} ${KOKKOSKERNELS_TPL_LIBS_CMD} ${KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD} --with-options=${KOKKOS_OPTIONS} --with-cuda-options=${KOKKOS_CUDA_OPTIONS} ${KOKKOS_BOUNDS_CHECK} ${KOKKOSKERNELS_SPACES} --no-examples ${KOKKOS_DEPRECATED_CODE} --cmake-flags=${PASSTHRU_CMAKE_FLAGS} --kokkos-cmake-flags=${KOKKOS_PASSTHRU_CMAKE_FLAGS} $extra_args" &>> call_generate_makefile.sh chmod +x call_generate_makefile.sh # script command with generic path for faster copy/paste of reproducer into issues - echo " # \$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=$LOCAL_KOKKOS_DEVICES $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --cxxstandard=\"$cxx_standard\" --ldflags=\"$ldflags\" $CUDA_ENABLE_CMD $HIP_ENABLE_CMD --kokkos-path=\$KOKKOS_PATH --kokkoskernels-path=\$KOKKOSKERNELS_PATH --with-scalars=$kk_scalars --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --with-layouts=${KOKKOSKERNELS_LAYOUTS} ${KOKKOSKERNELS_ENABLE_TPL_CMD} ${KOKKOSKERNELS_TPL_PATH_CMD} ${KOKKOSKERNELS_TPL_LIBS_CMD} ${KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD} --with-options=${KOKKOS_OPTIONS} --with-cuda-options=${KOKKOS_CUDA_OPTIONS} ${KOKKOS_BOUNDS_CHECK} ${KOKKOSKERNELS_SPACES} --no-examples ${KOKKOS_DEPRECATED_CODE} $extra_args" &> call_generate_makefile_genericpath.sh + echo " # \$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=$LOCAL_KOKKOS_DEVICES $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --cxxstandard=\"$cxx_standard\" --ldflags=\"$ldflags\" $CUDA_ENABLE_CMD $HIP_ENABLE_CMD --kokkos-path=\$KOKKOS_PATH --kokkoskernels-path=\$KOKKOSKERNELS_PATH --with-scalars=$kk_scalars --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --with-layouts=${KOKKOSKERNELS_LAYOUTS} ${KOKKOSKERNELS_ENABLE_TPL_CMD} ${KOKKOSKERNELS_TPL_PATH_CMD} ${KOKKOSKERNELS_TPL_LIBS_CMD} ${KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD} --with-options=${KOKKOS_OPTIONS} --with-cuda-options=${KOKKOS_CUDA_OPTIONS} ${KOKKOS_BOUNDS_CHECK} ${KOKKOSKERNELS_SPACES} --no-examples ${KOKKOS_DEPRECATED_CODE} --cmake-flags=${PASSTHRU_CMAKE_FLAGS} --kokkos-cmake-flags=${KOKKOS_PASSTHRU_CMAKE_FLAGS} $extra_args" &> call_generate_makefile_genericpath.sh - run_cmd ${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-devices=$LOCAL_KOKKOS_DEVICES $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --cxxstandard=\"$cxx_standard\" --ldflags=\"$ldflags\" $CUDA_ENABLE_CMD $HIP_ENABLE_CMD --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=$kk_scalars --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --with-layouts=${KOKKOSKERNELS_LAYOUTS} ${KOKKOSKERNELS_ENABLE_TPL_CMD} ${KOKKOSKERNELS_TPL_PATH_CMD} ${KOKKOSKERNELS_TPL_LIBS_CMD} ${KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD} ${KOKKOS_BOUNDS_CHECK} ${KOKKOSKERNELS_SPACES} --no-examples ${KOKKOS_DEPRECATED_CODE} $extra_args &>> ${desc}.configure.log || { report_and_log_test_result 1 ${desc} configure && return 0; } + run_cmd ${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-devices=$LOCAL_KOKKOS_DEVICES $ARCH_FLAG --compiler=$(which $compiler_exe) --cxxflags=\"$cxxflags\" --cxxstandard=\"$cxx_standard\" --ldflags=\"$ldflags\" $CUDA_ENABLE_CMD $HIP_ENABLE_CMD --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=$kk_scalars --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --with-layouts=${KOKKOSKERNELS_LAYOUTS} ${KOKKOSKERNELS_ENABLE_TPL_CMD} ${KOKKOSKERNELS_TPL_PATH_CMD} ${KOKKOSKERNELS_TPL_LIBS_CMD} ${KOKKOSKERNELS_EXTRA_LINKER_FLAGS_CMD} ${KOKKOS_BOUNDS_CHECK} ${KOKKOSKERNELS_SPACES} --no-examples ${KOKKOS_DEPRECATED_CODE} --cmake-flags=${PASSTHRU_CMAKE_FLAGS} --kokkos-cmake-flags=${KOKKOS_PASSTHRU_CMAKE_FLAGS} $extra_args &>> ${desc}.configure.log || { report_and_log_test_result 1 ${desc} configure && return 0; } local -i build_start_time=$(date +%s) run_cmd make -j $MAKE_PAR_LEVEL all >& ${desc}.build.log || { report_and_log_test_result 1 ${desc} build && return 0; } @@ -1421,9 +1395,15 @@ wait_summarize_and_exit() { rv=$rv+1 local str=$failed_test - local comp=$(echo "$str" | cut -d- -f1) - local vers=$(echo "$str" | cut -d- -f2) - local lbuild=$(echo "$str" | cut -d- -f3-) + # Note: all relevant info in str to assemble the build directory path + # is separated by dashes; however the compiler name may include dashes as well + # the final two pieces of str always the version and build-type (as set in BUILD_AND_TEST_DIR) + # leaving the compiler name as the remaining fields preceding version + local getdashes="${str//[^-]}" + local numdashes=${#getdashes} + local lbuild=$(echo "$str" | cut -d- -f${numdashes}-) + local vers=$(echo "$str" | cut -d- -f$((numdashes-1))) + local comp=$(echo "$str" | cut -d- -f-$((numdashes-2))) # Generate reproducer instructions #local filename=reproducer_instructions-$comp-$vers-$lbuild local faildir=$ROOT_DIR/$comp/$vers/$lbuild diff --git a/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_bsrmatrix_eti_spec_inst.cpp.in b/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_bsrmatrix_eti_spec_inst.cpp.in index 9895083764..077150f36c 100644 --- a/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_bsrmatrix_eti_spec_inst.cpp.in +++ b/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_bsrmatrix_eti_spec_inst.cpp.in @@ -19,11 +19,9 @@ #include "KokkosSparse_spmv_bsrmatrix_spec.hpp" namespace KokkosSparse { -namespace Experimental { namespace Impl { // clang-format off @SPARSE_SPMV_BSRMATRIX_ETI_INST_BLOCK@ // clang-format on } // namespace Impl -} // namespace Experimental -} // namespace KokkosSparse \ No newline at end of file +} // namespace KokkosSparse diff --git a/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_inst.cpp.in b/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_inst.cpp.in index d089eca0e3..2c9a6083bf 100644 --- a/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_inst.cpp.in +++ b/sparse/eti/generated_specializations_cpp/spmv/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_inst.cpp.in @@ -19,11 +19,9 @@ #include "KokkosSparse_spmv_bsrmatrix_spec.hpp" namespace KokkosSparse { -namespace Experimental { namespace Impl { // clang-format off @SPARSE_SPMV_MV_BSRMATRIX_ETI_INST_BLOCK@ /// // clang-format on } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse diff --git a/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_bsrmatrix_eti_spec_avail.hpp.in b/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_bsrmatrix_eti_spec_avail.hpp.in index f98e60ae0d..278b60a813 100644 --- a/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_bsrmatrix_eti_spec_avail.hpp.in +++ b/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_bsrmatrix_eti_spec_avail.hpp.in @@ -17,12 +17,10 @@ #ifndef KOKKOSSPARSE_SPMV_BSRMATRIX_ETI_SPEC_AVAIL_HPP_ #define KOKKOSSPARSE_SPMV_BSRMATRIX_ETI_SPEC_AVAIL_HPP_ namespace KokkosSparse { -namespace Experimental { namespace Impl { // clang-format off @SPARSE_SPMV_BSRMATRIX_ETI_AVAIL_BLOCK@ // clang-format on } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse #endif diff --git a/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_avail.hpp.in b/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_avail.hpp.in index df53928266..3247985f4c 100644 --- a/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_avail.hpp.in +++ b/sparse/eti/generated_specializations_hpp/KokkosSparse_spmv_mv_bsrmatrix_eti_spec_avail.hpp.in @@ -18,12 +18,10 @@ #define KOKKOSSPARSE_SPMV_MV_BSRMATRIX_ETI_SPEC_AVAIL_HPP_ namespace KokkosSparse { -namespace Experimental { namespace Impl { // clang-format off @SPARSE_SPMV_MV_BSRMATRIX_ETI_AVAIL_BLOCK@ // clang-format on } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse #endif diff --git a/sparse/impl/KokkosSparse_coo2crs_impl.hpp b/sparse/impl/KokkosSparse_coo2crs_impl.hpp index d00a6f34a9..aaa5cdcb72 100644 --- a/sparse/impl/KokkosSparse_coo2crs_impl.hpp +++ b/sparse/impl/KokkosSparse_coo2crs_impl.hpp @@ -15,11 +15,6 @@ //@HEADER #ifndef KOKKOSSPARSE_COO2CRS_IMPL_HPP #define KOKKOSSPARSE_COO2CRS_IMPL_HPP -// The unorderedmap changes necessary for this to work -// have not made it into Kokkos 4.0.00 pr 4.0.01 will -// need to see if it happens in 4.1.00 to have a final -// version check here. -#if KOKKOS_VERSION >= 40099 #include #include "Kokkos_UnorderedMap.hpp" @@ -196,8 +191,13 @@ class Coo2Crs { reinterpret_cast(Kokkos::kokkos_malloc( "m_umaps", m_nrows * sizeof(UmapType))); - using shallow_copy_to_device = - Kokkos::Impl::DeepCopy; + auto shallow_copy_to_device = [](UmapType *dst, UmapType const *src, + std::size_t cnt) { + std::size_t nn = cnt / sizeof(UmapType); + Kokkos::deep_copy( + Kokkos::View(dst, nn), + Kokkos::View(src, nn)); + }; UmapType **umap_ptrs = new UmapType *[m_nrows]; // TODO: use host-level parallel_for with tag rowmapRp1 @@ -275,6 +275,4 @@ class Coo2Crs { } // namespace Impl } // namespace KokkosSparse -#endif // KOKKOS_VERSION >= 40099 - #endif // KOKKOSSPARSE_COO2CRS_IMPL_HPP diff --git a/sparse/impl/KokkosSparse_gauss_seidel_impl.hpp b/sparse/impl/KokkosSparse_gauss_seidel_impl.hpp index 7391e00e3d..f0b78408bc 100644 --- a/sparse/impl/KokkosSparse_gauss_seidel_impl.hpp +++ b/sparse/impl/KokkosSparse_gauss_seidel_impl.hpp @@ -1547,8 +1547,8 @@ class PointGaussSeidel { Permuted_Yvector); } if (init_zero_x_vector) { - KokkosKernels::Impl::zero_vector< - MyExecSpace, scalar_persistent_work_view2d_t, MyExecSpace>( + KokkosKernels::Impl::zero_vector( my_exec_space, num_cols * block_size, Permuted_Xvector); } else { KokkosKernels::Impl::permute_block_vector< @@ -1664,8 +1664,8 @@ class PointGaussSeidel { Permuted_Yvector); } if (init_zero_x_vector) { - KokkosKernels::Impl::zero_vector< - MyExecSpace, scalar_persistent_work_view2d_t, MyExecSpace>( + KokkosKernels::Impl::zero_vector( my_exec_space, num_cols, Permuted_Xvector); } else { KokkosKernels::Impl::permute_vector< diff --git a/sparse/impl/KokkosSparse_gmres_impl.hpp b/sparse/impl/KokkosSparse_gmres_impl.hpp index 8c7231f90c..f616bfe8f3 100644 --- a/sparse/impl/KokkosSparse_gmres_impl.hpp +++ b/sparse/impl/KokkosSparse_gmres_impl.hpp @@ -70,7 +70,7 @@ struct GmresWrap { Kokkos::Profiling::pushRegion("GMRES::TotalTime:"); // Store solver options: - const auto n = A.numRows(); + const auto n = A.numPointRows(); const int m = thandle.get_m(); const auto maxRestart = thandle.get_max_restart(); const auto tol = thandle.get_tol(); diff --git a/sparse/impl/KokkosSparse_gmres_spec.hpp b/sparse/impl/KokkosSparse_gmres_spec.hpp index bfe1c4539a..a588793ff8 100644 --- a/sparse/impl/KokkosSparse_gmres_spec.hpp +++ b/sparse/impl/KokkosSparse_gmres_spec.hpp @@ -23,6 +23,7 @@ #include #include #include "KokkosSparse_CrsMatrix.hpp" +#include "KokkosSparse_BsrMatrix.hpp" #include "KokkosKernels_Handle.hpp" // Include the actual functors @@ -81,10 +82,15 @@ template ::value> struct GMRES { - using AMatrix = CrsMatrix; + using AMatrix = CrsMatrix; + using BAMatrix = KokkosSparse::Experimental::BsrMatrix; static void gmres( KernelHandle *handle, const AMatrix &A, const BType &B, XType &X, KokkosSparse::Experimental::Preconditioner *precond = nullptr); + + static void gmres( + KernelHandle *handle, const BAMatrix &A, const BType &B, XType &X, + KokkosSparse::Experimental::Preconditioner *precond = nullptr); }; #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY @@ -104,6 +110,17 @@ struct GMRES; + static void gmres( + KernelHandle *handle, const BAMatrix &A, const BType &B, XType &X, + KokkosSparse::Experimental::Preconditioner *precond = nullptr) { + auto gmres_handle = handle->get_gmres_handle(); + using Gmres = Experimental::GmresWrap< + typename std::remove_pointer::type>; + + Gmres::gmres(*gmres_handle, A, B, X, precond); + } }; #endif diff --git a/sparse/impl/KokkosSparse_par_ilut_numeric_impl.hpp b/sparse/impl/KokkosSparse_par_ilut_numeric_impl.hpp index 0ac9c26166..6bdf0eb577 100644 --- a/sparse/impl/KokkosSparse_par_ilut_numeric_impl.hpp +++ b/sparse/impl/KokkosSparse_par_ilut_numeric_impl.hpp @@ -588,7 +588,7 @@ struct IlutWrap { count); Kokkos::single(Kokkos::PerTeam(team), - [=]() { O_row_map(row_idx) = count; }); + [&]() { O_row_map(row_idx) = count; }); } float_t threshold; @@ -699,18 +699,24 @@ struct IlutWrap { multiply_matrices(kh, ih, L_row_map, L_entries, L_values, U_row_map, U_entries, U_values, LU_row_map, LU_entries, LU_values); - auto addHandle = kh.get_spadd_handle(); - KokkosSparse::Experimental::spadd_symbolic( - &kh, A_row_map, A_entries, LU_row_map, LU_entries, R_row_map); + auto addHandle = kh.get_spadd_handle(); + typename KHandle::const_nnz_lno_t m = A_row_map.extent(0) - 1, + n = m; // square matrix + // TODO: let compute_residual_norm also take an execution space argument and + // use that for exec! + typename KHandle::HandleExecSpace exec{}; + KokkosSparse::Experimental::spadd_symbolic(exec, &kh, m, n, A_row_map, + A_entries, LU_row_map, + LU_entries, R_row_map); const size_type r_nnz = addHandle->get_c_nnz(); - Kokkos::resize(R_entries, r_nnz); - Kokkos::resize(R_values, r_nnz); + Kokkos::resize(exec, R_entries, r_nnz); + Kokkos::resize(exec, R_values, r_nnz); KokkosSparse::Experimental::spadd_numeric( - &kh, A_row_map, A_entries, A_values, 1., LU_row_map, LU_entries, - LU_values, -1., R_row_map, R_entries, R_values); - + exec, &kh, m, n, A_row_map, A_entries, A_values, 1., LU_row_map, + LU_entries, LU_values, -1., R_row_map, R_entries, R_values); + // TODO: how to make this policy use exec? auto policy = ih.get_default_team_policy(); Kokkos::parallel_reduce( diff --git a/sparse/impl/KokkosSparse_spadd_numeric_impl.hpp b/sparse/impl/KokkosSparse_spadd_numeric_impl.hpp index 8e70cd3c3b..fa356dc963 100644 --- a/sparse/impl/KokkosSparse_spadd_numeric_impl.hpp +++ b/sparse/impl/KokkosSparse_spadd_numeric_impl.hpp @@ -174,24 +174,23 @@ struct UnsortedNumericSumFunctor { std::is_same::type, \ typename std::remove_const::type>::value -template +template < + typename execution_space, typename KernelHandle, typename alno_row_view_t, + typename alno_nnz_view_t, typename ascalar_t, typename ascalar_nnz_view_t, + typename blno_row_view_t, typename blno_nnz_view_t, typename bscalar_t, + typename bscalar_nnz_view_t, typename clno_row_view_t, + typename clno_nnz_view_t, typename cscalar_nnz_view_t> void spadd_numeric_impl( - KernelHandle* kernel_handle, const alno_row_view_t a_rowmap, - const alno_nnz_view_t a_entries, const ascalar_nnz_view_t a_values, - const ascalar_t alpha, const blno_row_view_t b_rowmap, - const blno_nnz_view_t b_entries, const bscalar_nnz_view_t b_values, - const bscalar_t beta, const clno_row_view_t c_rowmap, - clno_nnz_view_t c_entries, cscalar_nnz_view_t c_values) { + const execution_space& exec, KernelHandle* kernel_handle, + const alno_row_view_t a_rowmap, const alno_nnz_view_t a_entries, + const ascalar_nnz_view_t a_values, const ascalar_t alpha, + const blno_row_view_t b_rowmap, const blno_nnz_view_t b_entries, + const bscalar_nnz_view_t b_values, const bscalar_t beta, + const clno_row_view_t c_rowmap, clno_nnz_view_t c_entries, + cscalar_nnz_view_t c_values) { typedef typename KernelHandle::size_type size_type; typedef typename KernelHandle::nnz_lno_t ordinal_type; typedef typename KernelHandle::nnz_scalar_t scalar_type; - typedef - typename KernelHandle::SPADDHandleType::execution_space execution_space; // Check that A/B/C data types match KernelHandle types, and that C data types // are nonconst (doesn't matter if A/B types are const) static_assert(SAME_TYPE(ascalar_t, scalar_type), @@ -252,7 +251,7 @@ void spadd_numeric_impl( sortedNumeric(a_rowmap, b_rowmap, c_rowmap, a_entries, b_entries, c_entries, a_values, b_values, c_values, alpha, beta); Kokkos::parallel_for("KokkosSparse::SpAdd:Numeric::InputSorted", - range_type(0, nrows), sortedNumeric); + range_type(exec, 0, nrows), sortedNumeric); } else { // use a_pos and b_pos (set in the handle by symbolic) to quickly compute C // entries and values @@ -265,7 +264,7 @@ void spadd_numeric_impl( c_entries, a_values, b_values, c_values, alpha, beta, addHandle->get_a_pos(), addHandle->get_b_pos()); Kokkos::parallel_for("KokkosSparse::SpAdd:Numeric::InputNotSorted", - range_type(0, nrows), unsortedNumeric); + range_type(exec, 0, nrows), unsortedNumeric); } addHandle->set_call_numeric(); } diff --git a/sparse/impl/KokkosSparse_spadd_numeric_spec.hpp b/sparse/impl/KokkosSparse_spadd_numeric_spec.hpp index e81649f552..18731348de 100644 --- a/sparse/impl/KokkosSparse_spadd_numeric_spec.hpp +++ b/sparse/impl/KokkosSparse_spadd_numeric_spec.hpp @@ -28,10 +28,10 @@ namespace KokkosSparse { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct spadd_numeric_eti_spec_avail { enum : bool { value = false }; }; @@ -44,6 +44,7 @@ struct spadd_numeric_eti_spec_avail { MEM_SPACE_TYPE) \ template <> \ struct spadd_numeric_eti_spec_avail< \ + EXEC_SPACE_TYPE, \ KokkosKernels::Experimental::KokkosKernelsHandle< \ const OFFSET_TYPE, const ORDINAL_TYPE, const SCALAR_TYPE, \ EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ @@ -87,20 +88,22 @@ namespace Impl { // Unification layer /// \brief Implementation of KokkosBlas::spadd (sparse-sparse matrix addition) -template ::value, + ExecSpace, KernelHandle, a_size_view_t, a_lno_view_t, + a_scalar_view_t, b_size_view_t, b_lno_view_t, b_scalar_view_t, + c_size_view_t, c_lno_view_t, c_scalar_view_t>::value, bool eti_spec_avail = spadd_numeric_eti_spec_avail< - KernelHandle, a_size_view_t, a_lno_view_t, a_scalar_view_t, - b_size_view_t, b_lno_view_t, b_scalar_view_t, c_size_view_t, - c_lno_view_t, c_scalar_view_t>::value> + ExecSpace, KernelHandle, a_size_view_t, a_lno_view_t, + a_scalar_view_t, b_size_view_t, b_lno_view_t, b_scalar_view_t, + c_size_view_t, c_lno_view_t, c_scalar_view_t>::value> struct SPADD_NUMERIC { - static void spadd_numeric(KernelHandle *handle, + static void spadd_numeric(const ExecSpace &exec, KernelHandle *handle, + typename KernelHandle::const_nnz_lno_t m, + typename KernelHandle::const_nnz_lno_t n, typename a_scalar_view_t::const_value_type alpha, a_size_view_t row_mapA, a_lno_view_t entriesA, a_scalar_view_t valuesA, @@ -112,15 +115,17 @@ struct SPADD_NUMERIC { #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY -template -struct SPADD_NUMERIC { - static void spadd_numeric(KernelHandle *handle, +template +struct SPADD_NUMERIC< + ExecSpace, KernelHandle, a_size_view_t, a_lno_view_t, a_scalar_view_t, + b_size_view_t, b_lno_view_t, b_scalar_view_t, c_size_view_t, c_lno_view_t, + c_scalar_view_t, false, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY> { + static void spadd_numeric(const ExecSpace &exec, KernelHandle *handle, + typename KernelHandle::const_nnz_lno_t /* m */, + typename KernelHandle::const_nnz_lno_t /* n */, typename a_scalar_view_t::const_value_type alpha, a_size_view_t row_mapA, a_lno_view_t entriesA, a_scalar_view_t valuesA, @@ -128,8 +133,9 @@ struct SPADD_NUMERIC, \ @@ -178,6 +185,7 @@ struct SPADD_NUMERIC, \ @@ -210,6 +218,6 @@ struct SPADD_NUMERIC >, \ false, true>; -#include +#include #endif diff --git a/sparse/impl/KokkosSparse_spadd_symbolic_impl.hpp b/sparse/impl/KokkosSparse_spadd_symbolic_impl.hpp index 15132f9da3..80506e3056 100644 --- a/sparse/impl/KokkosSparse_spadd_symbolic_impl.hpp +++ b/sparse/impl/KokkosSparse_spadd_symbolic_impl.hpp @@ -371,50 +371,48 @@ struct MergeEntriesFunctor { }; // Run SortedCountEntries: non-GPU, always uses the RangePolicy version. -template +template void runSortedCountEntries( - const alno_row_view_t_& a_rowmap, const alno_nnz_view_t_& a_entries, - const blno_row_view_t_& b_rowmap, const blno_nnz_view_t_& b_entries, - const clno_row_view_t_& c_rowmap, - typename std::enable_if()>::type* = + const execution_space& exec, const alno_row_view_t_& a_rowmap, + const alno_nnz_view_t_& a_entries, const blno_row_view_t_& b_rowmap, + const blno_nnz_view_t_& b_entries, const clno_row_view_t_& c_rowmap, + typename std::enable_if< + !KokkosKernels::Impl::kk_is_gpu_exec_space()>::type* = nullptr) { using size_type = typename KernelHandle::size_type; using ordinal_type = typename KernelHandle::nnz_lno_t; - using execution_space = - typename KernelHandle::SPADDHandleType::execution_space; - using range_type = Kokkos::RangePolicy; - auto nrows = c_rowmap.extent(0) - 1; + using range_type = Kokkos::RangePolicy; + auto nrows = c_rowmap.extent(0) - 1; SortedCountEntriesRange countEntries(nrows, a_rowmap, a_entries, b_rowmap, b_entries, c_rowmap); Kokkos::parallel_for( "KokkosSparse::SpAdd::Symbolic::InputSorted::CountEntries", - range_type(0, nrows), countEntries); + range_type(exec, 0, nrows), countEntries); } // Run SortedCountEntries: GPU, uses the TeamPolicy or RangePolicy depending // on average nz per row (a runtime decision) -template +template void runSortedCountEntries( - const alno_row_view_t_& a_rowmap, const alno_nnz_view_t_& a_entries, - const blno_row_view_t_& b_rowmap, const blno_nnz_view_t_& b_entries, - const clno_row_view_t_& c_rowmap, - typename std::enable_if()>::type* = + const execution_space& exec, const alno_row_view_t_& a_rowmap, + const alno_nnz_view_t_& a_entries, const blno_row_view_t_& b_rowmap, + const blno_nnz_view_t_& b_entries, const clno_row_view_t_& c_rowmap, + typename std::enable_if< + KokkosKernels::Impl::kk_is_gpu_exec_space()>::type* = nullptr) { using size_type = typename KernelHandle::size_type; using ordinal_type = typename KernelHandle::nnz_lno_t; - using execution_space = - typename KernelHandle::SPADDHandleType::execution_space; - using RangePol = Kokkos::RangePolicy; - using TeamPol = Kokkos::TeamPolicy; - auto nrows = c_rowmap.extent(0) - 1; + using RangePol = Kokkos::RangePolicy; + using TeamPol = Kokkos::TeamPolicy; + auto nrows = c_rowmap.extent(0) - 1; size_type c_est_nnz = 1.4 * (a_entries.extent(0) + b_entries.extent(0)) / nrows; if (c_est_nnz <= 512) { @@ -435,14 +433,14 @@ void runSortedCountEntries( countEntries(nrows, a_rowmap, a_entries, b_rowmap, b_entries, c_rowmap); countEntries.sharedPerThread = pot_est_nnz; // compute largest possible team size - TeamPol testPolicy(1, 1, vector_length); + TeamPol testPolicy(exec, 1, 1, vector_length); testPolicy.set_scratch_size( 0, Kokkos::PerThread(pot_est_nnz * sizeof(ordinal_type))); int team_size = testPolicy.team_size_recommended(countEntries, Kokkos::ParallelForTag()); // construct real policy int league_size = (nrows + team_size - 1) / team_size; - TeamPol policy(league_size, team_size, vector_length); + TeamPol policy(exec, league_size, team_size, vector_length); policy.set_scratch_size( 0, Kokkos::PerThread(pot_est_nnz * sizeof(ordinal_type))); countEntries.totalShared = @@ -457,24 +455,23 @@ void runSortedCountEntries( countEntries(nrows, a_rowmap, a_entries, b_rowmap, b_entries, c_rowmap); Kokkos::parallel_for( "KokkosSparse::SpAdd::Symbolic::InputSorted::CountEntries", - RangePol(0, nrows), countEntries); + RangePol(exec, 0, nrows), countEntries); } } // Symbolic: count entries in each row in C to produce rowmap // kernel handle has information about whether it is sorted add or not. -template +template void spadd_symbolic_impl( - KernelHandle* handle, const alno_row_view_t_ a_rowmap, - const alno_nnz_view_t_ a_entries, const blno_row_view_t_ b_rowmap, - const blno_nnz_view_t_ b_entries, + const execution_space& exec, KernelHandle* handle, + const alno_row_view_t_ a_rowmap, const alno_nnz_view_t_ a_entries, + const blno_row_view_t_ b_rowmap, const blno_nnz_view_t_ b_entries, clno_row_view_t_ c_rowmap) // c_rowmap must already be allocated (doesn't // need to be initialized) { - typedef - typename KernelHandle::SPADDHandleType::execution_space execution_space; typedef typename KernelHandle::size_type size_type; typedef typename KernelHandle::nnz_lno_t ordinal_type; typedef typename KernelHandle::SPADDHandleType::nnz_lno_view_t ordinal_view_t; @@ -520,17 +517,18 @@ void spadd_symbolic_impl( ordinal_type nrows = a_rowmap.extent(0) - 1; typedef Kokkos::RangePolicy range_type; if (addHandle->is_input_sorted()) { - runSortedCountEntries( - a_rowmap, a_entries, b_rowmap, b_entries, c_rowmap); + runSortedCountEntries(exec, a_rowmap, a_entries, b_rowmap, + b_entries, c_rowmap); KokkosKernels::Impl::kk_exclusive_parallel_prefix_sum( - nrows + 1, c_rowmap); + exec, nrows + 1, c_rowmap); } else { // note: scoping individual parts of the process to free views sooner, // minimizing peak memory usage run the unsorted c_rowmap upper bound // functor (just adds together A and B entry counts row by row) offset_view_t c_rowmap_upperbound( - Kokkos::view_alloc(Kokkos::WithoutInitializing, + Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, "C row counts upper bound"), nrows + 1); size_type c_nnz_upperbound = 0; @@ -540,17 +538,17 @@ void spadd_symbolic_impl( countEntries(nrows, a_rowmap, b_rowmap, c_rowmap_upperbound); Kokkos::parallel_for( "KokkosSparse::SpAdd:Symbolic::InputNotSorted::CountEntries", - range_type(0, nrows), countEntries); + range_type(exec, 0, nrows), countEntries); KokkosKernels::Impl::kk_exclusive_parallel_prefix_sum( - nrows + 1, c_rowmap_upperbound); - Kokkos::deep_copy(c_nnz_upperbound, + exec, nrows + 1, c_rowmap_upperbound); + Kokkos::deep_copy(exec, c_nnz_upperbound, Kokkos::subview(c_rowmap_upperbound, nrows)); } ordinal_view_t c_entries_uncompressed( - Kokkos::view_alloc(Kokkos::WithoutInitializing, + Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, "C entries uncompressed"), c_nnz_upperbound); - ordinal_view_t ab_perm(Kokkos::view_alloc(Kokkos::WithoutInitializing, + ordinal_view_t ab_perm(Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, "A and B permuted entry indices"), c_nnz_upperbound); // compute the unmerged sum @@ -561,17 +559,17 @@ void spadd_symbolic_impl( c_rowmap_upperbound, c_entries_uncompressed, ab_perm); Kokkos::parallel_for( "KokkosSparse::SpAdd:Symbolic::InputNotSorted::UnmergedSum", - range_type(0, nrows), unmergedSum); + range_type(exec, 0, nrows), unmergedSum); // sort the unmerged sum KokkosSparse::sort_crs_matrix( - c_rowmap_upperbound, c_entries_uncompressed, ab_perm); - ordinal_view_t a_pos( - Kokkos::view_alloc(Kokkos::WithoutInitializing, "A entry positions"), - a_entries.extent(0)); - ordinal_view_t b_pos( - Kokkos::view_alloc(Kokkos::WithoutInitializing, "B entry positions"), - b_entries.extent(0)); + exec, c_rowmap_upperbound, c_entries_uncompressed, ab_perm); + ordinal_view_t a_pos(Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, + "A entry positions"), + a_entries.extent(0)); + ordinal_view_t b_pos(Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, + "B entry positions"), + b_entries.extent(0)); // merge the entries and compute Apos/Bpos, as well as Crowcounts { MergeEntriesFunctor( - nrows + 1, c_rowmap); + exec, nrows + 1, c_rowmap); } addHandle->set_a_b_pos(a_pos, b_pos); } // provide the number of NNZ in C to user through handle size_type cmax; - Kokkos::deep_copy(cmax, Kokkos::subview(c_rowmap, nrows)); + Kokkos::deep_copy(exec, cmax, Kokkos::subview(c_rowmap, nrows)); addHandle->set_c_nnz(cmax); addHandle->set_call_symbolic(); addHandle->set_call_numeric(false); diff --git a/sparse/impl/KokkosSparse_spadd_symbolic_spec.hpp b/sparse/impl/KokkosSparse_spadd_symbolic_spec.hpp index aaab68568a..bdc4ed04bd 100644 --- a/sparse/impl/KokkosSparse_spadd_symbolic_spec.hpp +++ b/sparse/impl/KokkosSparse_spadd_symbolic_spec.hpp @@ -28,8 +28,9 @@ namespace KokkosSparse { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct spadd_symbolic_eti_spec_avail { enum : bool { value = false }; }; @@ -42,6 +43,7 @@ struct spadd_symbolic_eti_spec_avail { MEM_SPACE_TYPE) \ template <> \ struct spadd_symbolic_eti_spec_avail< \ + EXEC_SPACE_TYPE, \ KokkosKernels::Experimental::KokkosKernelsHandle< \ const OFFSET_TYPE, const ORDINAL_TYPE, const SCALAR_TYPE, \ EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ @@ -73,31 +75,39 @@ namespace Impl { // Unification layer /// \brief Implementation of KokkosBlas::spadd (sparse-sparse matrix addition) -template ::value, + ExecSpace, KernelHandle, a_size_view_t, a_lno_view_t, + b_size_view_t, b_lno_view_t, c_size_view_t>::value, bool eti_spec_avail = spadd_symbolic_eti_spec_avail< - KernelHandle, a_size_view_t, a_lno_view_t, b_size_view_t, - b_lno_view_t, c_size_view_t>::value> + ExecSpace, KernelHandle, a_size_view_t, a_lno_view_t, + b_size_view_t, b_lno_view_t, c_size_view_t>::value> struct SPADD_SYMBOLIC { - static void spadd_symbolic(KernelHandle *handle, a_size_view_t row_mapA, - a_lno_view_t entriesA, b_size_view_t row_mapB, - b_lno_view_t entriesB, c_size_view_t row_mapC); + static void spadd_symbolic(const ExecSpace &exec, KernelHandle *handle, + typename KernelHandle::const_nnz_lno_t m, + typename KernelHandle::const_nnz_lno_t n, + a_size_view_t row_mapA, a_lno_view_t entriesA, + b_size_view_t row_mapB, b_lno_view_t entriesB, + c_size_view_t row_mapC); }; #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY -template -struct SPADD_SYMBOLIC +struct SPADD_SYMBOLIC { - static void spadd_symbolic(KernelHandle *handle, a_size_view_t row_mapA, - a_lno_view_t entriesA, b_size_view_t row_mapB, - b_lno_view_t entriesB, c_size_view_t row_mapC) { - spadd_symbolic_impl(handle, row_mapA, entriesA, row_mapB, entriesB, + static void spadd_symbolic(const ExecSpace &exec, KernelHandle *handle, + typename KernelHandle::const_nnz_lno_t /* m */, + typename KernelHandle::const_nnz_lno_t /* n */, + a_size_view_t row_mapA, a_lno_view_t entriesA, + b_size_view_t row_mapB, b_lno_view_t entriesB, + c_size_view_t row_mapC) { + spadd_symbolic_impl(exec, handle, row_mapA, entriesA, row_mapB, entriesB, row_mapC); } }; @@ -111,6 +121,7 @@ struct SPADD_SYMBOLIC, \ @@ -135,6 +146,7 @@ struct SPADD_SYMBOLIC, \ @@ -155,6 +167,6 @@ struct SPADD_SYMBOLIC >, \ false, true>; -#include +#include #endif diff --git a/sparse/impl/KokkosSparse_spiluk_numeric_impl.hpp b/sparse/impl/KokkosSparse_spiluk_numeric_impl.hpp index c2863885b2..b3b5dfa277 100644 --- a/sparse/impl/KokkosSparse_spiluk_numeric_impl.hpp +++ b/sparse/impl/KokkosSparse_spiluk_numeric_impl.hpp @@ -21,8 +21,17 @@ /// \brief Implementation(s) of the numeric phase of sparse ILU(k). #include +#include #include #include +#include "KokkosBatched_SetIdentity_Decl.hpp" +#include "KokkosBatched_SetIdentity_Impl.hpp" +#include "KokkosBatched_Trsm_Decl.hpp" +#include "KokkosBatched_Trsm_Serial_Impl.hpp" +#include "KokkosBatched_Axpy.hpp" +#include "KokkosBatched_Gemm_Decl.hpp" +#include "KokkosBatched_Gemm_Serial_Impl.hpp" +#include "KokkosBlas1_set.hpp" //#define NUMERIC_OUTPUT_INFO @@ -30,391 +39,514 @@ namespace KokkosSparse { namespace Impl { namespace Experimental { -// struct UnsortedTag {}; - -template -struct ILUKLvlSchedRPNumericFunctor { - using lno_t = typename AEntriesType::non_const_value_type; - using scalar_t = typename AValuesType::non_const_value_type; - ARowMapType A_row_map; - AEntriesType A_entries; - AValuesType A_values; - LRowMapType L_row_map; - LEntriesType L_entries; - LValuesType L_values; - URowMapType U_row_map; - UEntriesType U_entries; - UValuesType U_values; - LevelViewType level_idx; - WorkViewType iw; - nnz_lno_t lev_start; - - ILUKLvlSchedRPNumericFunctor( - const ARowMapType &A_row_map_, const AEntriesType &A_entries_, - const AValuesType &A_values_, const LRowMapType &L_row_map_, - const LEntriesType &L_entries_, LValuesType &L_values_, - const URowMapType &U_row_map_, const UEntriesType &U_entries_, - UValuesType &U_values_, const LevelViewType &level_idx_, - WorkViewType &iw_, const nnz_lno_t &lev_start_) - : A_row_map(A_row_map_), - A_entries(A_entries_), - A_values(A_values_), - L_row_map(L_row_map_), - L_entries(L_entries_), - L_values(L_values_), - U_row_map(U_row_map_), - U_entries(U_entries_), - U_values(U_values_), - level_idx(level_idx_), - iw(iw_), - lev_start(lev_start_) {} - - KOKKOS_INLINE_FUNCTION - void operator()(const lno_t i) const { - auto rowid = level_idx(i); - auto tid = i - lev_start; - auto k1 = L_row_map(rowid); - auto k2 = L_row_map(rowid + 1); -#ifdef KEEP_DIAG - for (auto k = k1; k < k2 - 1; ++k) { -#else - for (auto k = k1; k < k2; ++k) { -#endif - auto col = L_entries(k); - L_values(k) = 0.0; - iw(tid, col) = k; +template +struct IlukWrap { + // + // Useful types + // + using execution_space = typename IlukHandle::execution_space; + using memory_space = typename IlukHandle::memory_space; + using lno_t = typename IlukHandle::nnz_lno_t; + using size_type = typename IlukHandle::size_type; + using scalar_t = typename IlukHandle::nnz_scalar_t; + using HandleDeviceRowMapType = typename IlukHandle::nnz_row_view_t; + using HandleDeviceValueType = typename IlukHandle::nnz_value_view_t; + using WorkViewType = typename IlukHandle::work_view_t; + using LevelHostViewType = typename IlukHandle::nnz_lno_view_host_t; + using LevelViewType = typename IlukHandle::nnz_lno_view_t; + using karith = typename Kokkos::ArithTraits; + using team_policy = typename IlukHandle::TeamPolicy; + using member_type = typename team_policy::member_type; + using range_policy = typename IlukHandle::RangePolicy; + + static team_policy get_team_policy(const size_type nrows, + const int team_size) { + team_policy rv; + if (team_size == -1) { + rv = team_policy(nrows, Kokkos::AUTO); + } else { + rv = team_policy(nrows, team_size); } -#ifdef KEEP_DIAG - L_values(k2 - 1) = scalar_t(1.0); -#endif - k1 = U_row_map(rowid); - k2 = U_row_map(rowid + 1); - for (auto k = k1; k < k2; ++k) { - auto col = U_entries(k); - U_values(k) = 0.0; - iw(tid, col) = k; + return rv; + } + + static team_policy get_team_policy(execution_space exe_space, + const size_type nrows, + const int team_size) { + team_policy rv; + if (team_size == -1) { + rv = team_policy(exe_space, nrows, Kokkos::AUTO); + } else { + rv = team_policy(exe_space, nrows, team_size); } - // Unpack the ith row of A - k1 = A_row_map(rowid); - k2 = A_row_map(rowid + 1); - for (auto k = k1; k < k2; ++k) { - auto col = A_entries(k); - auto ipos = iw(tid, col); - if (col < rowid) - L_values(ipos) = A_values(k); - else - U_values(ipos) = A_values(k); + return rv; + } + + /** + * Common base class for SPILUK functors. Default version does not support + * blocks + */ + template + struct Common { + ARowMapType A_row_map; + AEntriesType A_entries; + AValuesType A_values; + LRowMapType L_row_map; + LEntriesType L_entries; + LValuesType L_values; + URowMapType U_row_map; + UEntriesType U_entries; + UValuesType U_values; + LevelViewType level_idx; + WorkViewType iw; + lno_t lev_start; + + using reftype = scalar_t &; + + Common(const ARowMapType &A_row_map_, const AEntriesType &A_entries_, + const AValuesType &A_values_, const LRowMapType &L_row_map_, + const LEntriesType &L_entries_, LValuesType &L_values_, + const URowMapType &U_row_map_, const UEntriesType &U_entries_, + UValuesType &U_values_, const LevelViewType &level_idx_, + WorkViewType &iw_, const lno_t &lev_start_, + const size_type &block_size_) + : A_row_map(A_row_map_), + A_entries(A_entries_), + A_values(A_values_), + L_row_map(L_row_map_), + L_entries(L_entries_), + L_values(L_values_), + U_row_map(U_row_map_), + U_entries(U_entries_), + U_values(U_values_), + level_idx(level_idx_), + iw(iw_), + lev_start(lev_start_) { + KK_REQUIRE_MSG(block_size_ == 0, + "Tried to use blocks with the unblocked Common?"); } - // Eliminate prev rows - k1 = L_row_map(rowid); - k2 = L_row_map(rowid + 1); -#ifdef KEEP_DIAG - for (auto k = k1; k < k2 - 1; ++k) { -#else - for (auto k = k1; k < k2; ++k) { -#endif - auto prev_row = L_entries(k); -#ifdef KEEP_DIAG - auto fact = L_values(k) / U_values(U_row_map(prev_row)); -#else - auto fact = L_values(k) * U_values(U_row_map(prev_row)); -#endif - L_values(k) = fact; - for (auto kk = U_row_map(prev_row) + 1; kk < U_row_map(prev_row + 1); - ++kk) { - auto col = U_entries(kk); - auto ipos = iw(tid, col); - if (ipos == -1) continue; - auto lxu = -U_values(kk) * fact; - if (col < rowid) - L_values(ipos) += lxu; - else - U_values(ipos) += lxu; - } // end for kk - } // end for k - -#ifdef KEEP_DIAG - if (U_values(iw(tid, rowid)) == 0.0) { - U_values(iw(tid, rowid)) = 1e6; + // lset + KOKKOS_INLINE_FUNCTION + void lset(const size_type nnz, const scalar_t &value) const { + L_values(nnz) = value; } -#else - if (U_values(iw(tid, rowid)) == 0.0) { - U_values(iw(tid, rowid)) = 1e6; - } else { - U_values(iw(tid, rowid)) = 1.0 / U_values(iw(tid, rowid)); + + // uset + KOKKOS_INLINE_FUNCTION + void uset(const size_type nnz, const scalar_t &value) const { + U_values(nnz) = value; } -#endif - // Reset - k1 = L_row_map(rowid); - k2 = L_row_map(rowid + 1); -#ifdef KEEP_DIAG - for (auto k = k1; k < k2 - 1; ++k) -#else - for (auto k = k1; k < k2; ++k) -#endif - iw(tid, L_entries(k)) = -1; + // lset_id + KOKKOS_INLINE_FUNCTION + void lset_id(const member_type &team, const size_type nnz) const { + // Not sure a Kokkos::single is really needed here since the + // race is harmless + Kokkos::single(Kokkos::PerTeam(team), + [&]() { L_values(nnz) = scalar_t(1.0); }); + } - k1 = U_row_map(rowid); - k2 = U_row_map(rowid + 1); - for (auto k = k1; k < k2; ++k) iw(tid, U_entries(k)) = -1; - } -}; - -template -struct ILUKLvlSchedTP1NumericFunctor { - using execution_space = typename ARowMapType::execution_space; - using policy_type = Kokkos::TeamPolicy; - using member_type = typename policy_type::member_type; - using size_type = typename ARowMapType::non_const_value_type; - using lno_t = typename AEntriesType::non_const_value_type; - using scalar_t = typename AValuesType::non_const_value_type; - - ARowMapType A_row_map; - AEntriesType A_entries; - AValuesType A_values; - LRowMapType L_row_map; - LEntriesType L_entries; - LValuesType L_values; - URowMapType U_row_map; - UEntriesType U_entries; - UValuesType U_values; - LevelViewType level_idx; - WorkViewType iw; - nnz_lno_t lev_start; - - ILUKLvlSchedTP1NumericFunctor( - const ARowMapType &A_row_map_, const AEntriesType &A_entries_, - const AValuesType &A_values_, const LRowMapType &L_row_map_, - const LEntriesType &L_entries_, LValuesType &L_values_, - const URowMapType &U_row_map_, const UEntriesType &U_entries_, - UValuesType &U_values_, const LevelViewType &level_idx_, - WorkViewType &iw_, const nnz_lno_t &lev_start_) - : A_row_map(A_row_map_), - A_entries(A_entries_), - A_values(A_values_), - L_row_map(L_row_map_), - L_entries(L_entries_), - L_values(L_values_), - U_row_map(U_row_map_), - U_entries(U_entries_), - U_values(U_values_), - level_idx(level_idx_), - iw(iw_), - lev_start(lev_start_) {} - - KOKKOS_INLINE_FUNCTION - void operator()(const member_type &team) const { - nnz_lno_t my_team = static_cast(team.league_rank()); - nnz_lno_t rowid = - static_cast(level_idx(my_team + lev_start)); // map to rowid - - size_type k1 = static_cast(L_row_map(rowid)); - size_type k2 = static_cast(L_row_map(rowid + 1)); -#ifdef KEEP_DIAG - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2 - 1), - [&](const size_type k) { - nnz_lno_t col = static_cast(L_entries(k)); - L_values(k) = 0.0; - iw(my_team, col) = static_cast(k); - }); -#else - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), - [&](const size_type k) { - nnz_lno_t col = static_cast(L_entries(k)); - L_values(k) = 0.0; - iw(my_team, col) = static_cast(k); - }); -#endif + // divide. lhs /= rhs + KOKKOS_INLINE_FUNCTION + void divide(const member_type &team, scalar_t &lhs, + const scalar_t &rhs) const { + Kokkos::single(Kokkos::PerTeam(team), [&]() { lhs /= rhs; }); + team.team_barrier(); + } -#ifdef KEEP_DIAG - // if (my_thread == 0) L_values(k2 - 1) = scalar_t(1.0); - Kokkos::single(Kokkos::PerTeam(team), - [&]() { L_values(k2 - 1) = scalar_t(1.0); }); -#endif + // multiply_subtract. C -= A * B + KOKKOS_INLINE_FUNCTION + void multiply_subtract(const scalar_t &A, const scalar_t &B, + scalar_t &C) const { + C -= A * B; + } - team.team_barrier(); - - k1 = static_cast(U_row_map(rowid)); - k2 = static_cast(U_row_map(rowid + 1)); - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), - [&](const size_type k) { - nnz_lno_t col = static_cast(U_entries(k)); - U_values(k) = 0.0; - iw(my_team, col) = static_cast(k); - }); - - team.team_barrier(); - - // Unpack the ith row of A - k1 = static_cast(A_row_map(rowid)); - k2 = static_cast(A_row_map(rowid + 1)); - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), - [&](const size_type k) { - nnz_lno_t col = static_cast(A_entries(k)); - nnz_lno_t ipos = iw(my_team, col); - if (col < rowid) - L_values(ipos) = A_values(k); - else - U_values(ipos) = A_values(k); - }); - - team.team_barrier(); - - // Eliminate prev rows - k1 = static_cast(L_row_map(rowid)); - k2 = static_cast(L_row_map(rowid + 1)); -#ifdef KEEP_DIAG - for (size_type k = k1; k < k2 - 1; k++) -#else - for (size_type k = k1; k < k2; k++) -#endif - { - nnz_lno_t prev_row = L_entries(k); - - scalar_t fact = scalar_t(0.0); - Kokkos::single( - Kokkos::PerTeam(team), - [&](scalar_t &tmp_fact) { -#ifdef KEEP_DIAG - tmp_fact = L_values(k) / U_values(U_row_map(prev_row)); -#else - tmp_fact = L_values(k) * U_values(U_row_map(prev_row)); -#endif - L_values(k) = tmp_fact; - }, - fact); - - Kokkos::parallel_for( - Kokkos::TeamThreadRange(team, U_row_map(prev_row) + 1, - U_row_map(prev_row + 1)), - [&](const size_type kk) { - nnz_lno_t col = static_cast(U_entries(kk)); - nnz_lno_t ipos = iw(my_team, col); - auto lxu = -U_values(kk) * fact; - if (ipos != -1) { - if (col < rowid) - L_values(ipos) += lxu; - else - U_values(ipos) += lxu; - } - }); // end for kk + // lget + KOKKOS_INLINE_FUNCTION + scalar_t &lget(const size_type nnz) const { return L_values(nnz); } - team.team_barrier(); - } // end for k - - // if (my_thread == 0) { - Kokkos::single(Kokkos::PerTeam(team), [&]() { - nnz_lno_t ipos = iw(my_team, rowid); -#ifdef KEEP_DIAG - if (U_values(ipos) == 0.0) { - U_values(ipos) = 1e6; + // uget + KOKKOS_INLINE_FUNCTION + scalar_t &uget(const size_type nnz) const { return U_values(nnz); } + + // aget + KOKKOS_INLINE_FUNCTION + scalar_t aget(const size_type nnz) const { return A_values(nnz); } + + // uequal + KOKKOS_INLINE_FUNCTION + bool uequal(const size_type nnz, const scalar_t &value) const { + return U_values(nnz) == value; + } + + // print + KOKKOS_INLINE_FUNCTION + void print(const scalar_t &item) const { std::cout << item << std::endl; } + }; + + // Partial specialization for block support + template + struct Common { + ARowMapType A_row_map; + AEntriesType A_entries; + AValuesType A_values; + LRowMapType L_row_map; + LEntriesType L_entries; + LValuesType L_values; + URowMapType U_row_map; + UEntriesType U_entries; + UValuesType U_values; + LevelViewType level_idx; + WorkViewType iw; + lno_t lev_start; + size_type block_size; + size_type block_items; + + // BSR data is in LayoutRight! + using Layout = Kokkos::LayoutRight; + + using LBlock = Kokkos::View< + typename LValuesType::value_type **, Layout, + typename LValuesType::device_type, + Kokkos::MemoryTraits >; + + using UBlock = Kokkos::View< + typename UValuesType::value_type **, Layout, + typename UValuesType::device_type, + Kokkos::MemoryTraits >; + + using ABlock = Kokkos::View< + typename AValuesType::value_type **, Layout, + typename AValuesType::device_type, + Kokkos::MemoryTraits >; + + using reftype = LBlock; + + Common(const ARowMapType &A_row_map_, const AEntriesType &A_entries_, + const AValuesType &A_values_, const LRowMapType &L_row_map_, + const LEntriesType &L_entries_, LValuesType &L_values_, + const URowMapType &U_row_map_, const UEntriesType &U_entries_, + UValuesType &U_values_, const LevelViewType &level_idx_, + WorkViewType &iw_, const lno_t &lev_start_, + const size_type &block_size_) + : A_row_map(A_row_map_), + A_entries(A_entries_), + A_values(A_values_), + L_row_map(L_row_map_), + L_entries(L_entries_), + L_values(L_values_), + U_row_map(U_row_map_), + U_entries(U_entries_), + U_values(U_values_), + level_idx(level_idx_), + iw(iw_), + lev_start(lev_start_), + block_size(block_size_), + block_items(block_size * block_size) { + KK_REQUIRE_MSG(block_size > 0, + "Tried to use block_size=0 with the blocked Common?"); + } + + // lset + KOKKOS_INLINE_FUNCTION + void lset(const size_type block, const scalar_t &value) const { + KokkosBlas::SerialSet::invoke(value, lget(block)); + } + + KOKKOS_INLINE_FUNCTION + void lset(const size_type block, const ABlock &rhs) const { + auto lblock = lget(block); + for (size_type i = 0; i < block_size; ++i) { + for (size_type j = 0; j < block_size; ++j) { + lblock(i, j) = rhs(i, j); + } } -#else - if (U_values(ipos) == 0.0) { - U_values(ipos) = 1e6; - } else { - U_values(ipos) = 1.0 / U_values(ipos); + } + + // uset + KOKKOS_INLINE_FUNCTION + void uset(const size_type block, const scalar_t &value) const { + KokkosBlas::SerialSet::invoke(value, uget(block)); + } + + KOKKOS_INLINE_FUNCTION + void uset(const size_type block, const ABlock &rhs) const { + auto ublock = uget(block); + for (size_type i = 0; i < block_size; ++i) { + for (size_type j = 0; j < block_size; ++j) { + ublock(i, j) = rhs(i, j); + } } -#endif - }); - //} - - team.team_barrier(); - - // Reset - k1 = static_cast(L_row_map(rowid)); - k2 = static_cast(L_row_map(rowid + 1)); -#ifdef KEEP_DIAG - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2 - 1), - [&](const size_type k) { - nnz_lno_t col = static_cast(L_entries(k)); - iw(my_team, col) = -1; - }); -#else - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), - [&](const size_type k) { - nnz_lno_t col = static_cast(L_entries(k)); - iw(my_team, col) = -1; - }); -#endif + } + + // lset_id + KOKKOS_INLINE_FUNCTION + void lset_id(const member_type &team, const size_type block) const { + KokkosBatched::TeamSetIdentity::invoke(team, lget(block)); + } + + // divide. lhs /= rhs + KOKKOS_INLINE_FUNCTION + void divide(const member_type &team, const LBlock &lhs, + const UBlock &rhs) const { + KokkosBatched::TeamTrsm< + member_type, KokkosBatched::Side::Right, KokkosBatched::Uplo::Upper, + KokkosBatched::Trans::NoTranspose, // not 100% on this + KokkosBatched::Diag::NonUnit, + KokkosBatched::Algo::Trsm::Unblocked>:: // not 100% on this + invoke(team, 1.0, rhs, lhs); + } + + // multiply_subtract. C -= A * B + template + KOKKOS_INLINE_FUNCTION void multiply_subtract(const UBlock &A, + const LBlock &B, + CView &C) const { + // Use gemm. alpha is hardcoded to -1, beta hardcoded to 1 + KokkosBatched::SerialGemm< + KokkosBatched::Trans::NoTranspose, KokkosBatched::Trans::NoTranspose, + KokkosBatched::Algo::Gemm::Unblocked>::invoke( + -1.0, A, B, 1.0, C); + } + + // lget + KOKKOS_INLINE_FUNCTION + LBlock lget(const size_type block) const { + return LBlock(L_values.data() + (block * block_items), block_size, + block_size); + } + + // uget + KOKKOS_INLINE_FUNCTION + UBlock uget(const size_type block) const { + return UBlock(U_values.data() + (block * block_items), block_size, + block_size); + } + + // aget + KOKKOS_INLINE_FUNCTION + ABlock aget(const size_type block) const { + return ABlock(A_values.data() + (block * block_items), block_size, + block_size); + } + + // uequal + KOKKOS_INLINE_FUNCTION + bool uequal(const size_type block, const scalar_t &value) const { + auto u_block = uget(block); + for (size_type i = 0; i < block_size; ++i) { + for (size_type j = 0; j < block_size; ++j) { + if (u_block(i, j) != value) { + return false; + } + } + } + return true; + } - k1 = static_cast(U_row_map(rowid)); - k2 = static_cast(U_row_map(rowid + 1)); - Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), - [&](const size_type k) { - nnz_lno_t col = static_cast(U_entries(k)); - iw(my_team, col) = -1; - }); + // print + KOKKOS_INLINE_FUNCTION + void print(const LBlock &item) const { + for (size_type i = 0; i < block_size; ++i) { + std::cout << " "; + for (size_type j = 0; j < block_size; ++j) { + std::cout << item(i, j) << " "; + } + std::cout << std::endl; + } + } + }; + + template + struct ILUKLvlSchedTP1NumericFunctor + : public Common { + using Base = Common; + + ILUKLvlSchedTP1NumericFunctor( + const ARowMapType &A_row_map_, const AEntriesType &A_entries_, + const AValuesType &A_values_, const LRowMapType &L_row_map_, + const LEntriesType &L_entries_, LValuesType &L_values_, + const URowMapType &U_row_map_, const UEntriesType &U_entries_, + UValuesType &U_values_, const LevelViewType &level_idx_, + WorkViewType &iw_, const lno_t &lev_start_, + const size_type &block_size_ = 0) + : Base(A_row_map_, A_entries_, A_values_, L_row_map_, L_entries_, + L_values_, U_row_map_, U_entries_, U_values_, level_idx_, iw_, + lev_start_, block_size_) {} + + KOKKOS_INLINE_FUNCTION + void operator()(const member_type &team) const { + const auto my_team = team.league_rank(); + const auto rowid = + Base::level_idx(my_team + Base::lev_start); // map to rowid + + // Set active entries in L to zero, store active cols in iw + // Set L diagonal for this row to identity + size_type k1 = Base::L_row_map(rowid); + size_type k2 = Base::L_row_map(rowid + 1) - 1; + Base::lset_id(team, k2); + Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), + [&](const size_type k) { + const auto col = Base::L_entries(k); + Base::lset(k, 0.0); + Base::iw(my_team, col) = k; + }); + + team.team_barrier(); + + // Set active entries in U to zero, store active cols in iw + k1 = Base::U_row_map(rowid); + k2 = Base::U_row_map(rowid + 1); + Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), + [&](const size_type k) { + const auto col = Base::U_entries(k); + Base::uset(k, 0.0); + Base::iw(my_team, col) = k; + }); + + team.team_barrier(); + + // Unpack the rowid-th row of A, copy into L,U + k1 = Base::A_row_map(rowid); + k2 = Base::A_row_map(rowid + 1); + Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), + [&](const size_type k) { + const auto col = Base::A_entries(k); + const auto ipos = Base::iw(my_team, col); + if (col < rowid) { + Base::lset(ipos, Base::aget(k)); + } else { + Base::uset(ipos, Base::aget(k)); + } + }); + + team.team_barrier(); + + // Eliminate prev rows + k1 = Base::L_row_map(rowid); + k2 = Base::L_row_map(rowid + 1) - 1; + for (auto k = k1; k < k2; k++) { + const auto prev_row = Base::L_entries(k); + const auto udiag = Base::uget(Base::U_row_map(prev_row)); + Base::divide(team, Base::lget(k), udiag); + auto fact = Base::lget(k); + Kokkos::parallel_for( + Kokkos::TeamThreadRange(team, Base::U_row_map(prev_row) + 1, + Base::U_row_map(prev_row + 1)), + [&](const size_type kk) { + const auto col = Base::U_entries(kk); + const auto ipos = Base::iw(my_team, col); + if (ipos != -1) { + typename Base::reftype C = + col < rowid ? Base::lget(ipos) : Base::uget(ipos); + Base::multiply_subtract(fact, Base::uget(kk), C); + } + }); // end for kk + + team.team_barrier(); + } // end for k + + Kokkos::single(Kokkos::PerTeam(team), [&]() { + const auto ipos = Base::iw(my_team, rowid); + if (Base::uequal(ipos, 0.0)) { + Base::uset(ipos, 1e6); + } + }); + + team.team_barrier(); + + // Reset + k1 = Base::L_row_map(rowid); + k2 = Base::L_row_map(rowid + 1) - 1; + Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), + [&](const size_type k) { + const auto col = Base::L_entries(k); + Base::iw(my_team, col) = -1; + }); + + k1 = Base::U_row_map(rowid); + k2 = Base::U_row_map(rowid + 1); + Kokkos::parallel_for(Kokkos::TeamThreadRange(team, k1, k2), + [&](const size_type k) { + const auto col = Base::U_entries(k); + Base::iw(my_team, col) = -1; + }); + } + }; + +#define FunctorTypeMacro(Functor, BlockEnabled) \ + Functor + +#define KernelLaunchMacro(arow, aent, aval, lrow, lent, lval, urow, uent, \ + uval, polc, name, lidx, iwv, lstrt, ftf, ftb, be, \ + bs) \ + if (be) { \ + ftb functor(arow, aent, aval, lrow, lent, lval, urow, uent, uval, lidx, \ + iwv, lstrt, bs); \ + Kokkos::parallel_for(name, polc, functor); \ + } else { \ + ftf functor(arow, aent, aval, lrow, lent, lval, urow, uent, uval, lidx, \ + iwv, lstrt); \ + Kokkos::parallel_for(name, polc, functor); \ } -}; - -template -void iluk_numeric(IlukHandle &thandle, const ARowMapType &A_row_map, - const AEntriesType &A_entries, const AValuesType &A_values, - const LRowMapType &L_row_map, const LEntriesType &L_entries, - LValuesType &L_values, const URowMapType &U_row_map, - const UEntriesType &U_entries, UValuesType &U_values) { - using execution_space = typename IlukHandle::execution_space; - using size_type = typename IlukHandle::size_type; - using nnz_lno_t = typename IlukHandle::nnz_lno_t; - using HandleDeviceEntriesType = typename IlukHandle::nnz_lno_view_t; - using WorkViewType = typename IlukHandle::work_view_t; - using LevelHostViewType = typename IlukHandle::nnz_lno_view_host_t; - - size_type nlevels = thandle.get_num_levels(); - int team_size = thandle.get_team_size(); - - LevelHostViewType level_ptr_h = thandle.get_host_level_ptr(); - HandleDeviceEntriesType level_idx = thandle.get_level_idx(); - - LevelHostViewType level_nchunks_h, level_nrowsperchunk_h; - WorkViewType iw; - - //{ - if (thandle.get_algorithm() == - KokkosSparse::Experimental::SPILUKAlgorithm::SEQLVLSCHD_TP1) { + + template + static void iluk_numeric(IlukHandle &thandle, const ARowMapType &A_row_map, + const AEntriesType &A_entries, + const AValuesType &A_values, + const LRowMapType &L_row_map, + const LEntriesType &L_entries, LValuesType &L_values, + const URowMapType &U_row_map, + const UEntriesType &U_entries, + UValuesType &U_values) { + using TPF = FunctorTypeMacro(ILUKLvlSchedTP1NumericFunctor, false); + using TPB = FunctorTypeMacro(ILUKLvlSchedTP1NumericFunctor, true); + + size_type nlevels = thandle.get_num_levels(); + int team_size = thandle.get_team_size(); + const auto block_size = thandle.get_block_size(); + const auto block_enabled = thandle.is_block_enabled(); + + LevelHostViewType level_ptr_h = thandle.get_host_level_ptr(); + LevelViewType level_idx = thandle.get_level_idx(); + + LevelHostViewType level_nchunks_h, level_nrowsperchunk_h; + WorkViewType iw; + level_nchunks_h = thandle.get_level_nchunks(); level_nrowsperchunk_h = thandle.get_level_nrowsperchunk(); - } - iw = thandle.get_iw(); + iw = thandle.get_iw(); - // Main loop must be performed sequential. Question: Try out Cuda's graph - // stuff to reduce kernel launch overhead - for (size_type lvl = 0; lvl < nlevels; ++lvl) { - nnz_lno_t lev_start = level_ptr_h(lvl); - nnz_lno_t lev_end = level_ptr_h(lvl + 1); + // Main loop must be performed sequential. Question: Try out Cuda's graph + // stuff to reduce kernel launch overhead + for (size_type lvl = 0; lvl < nlevels; ++lvl) { + lno_t lev_start = level_ptr_h(lvl); + lno_t lev_end = level_ptr_h(lvl + 1); - if ((lev_end - lev_start) != 0) { - if (thandle.get_algorithm() == - KokkosSparse::Experimental::SPILUKAlgorithm::SEQLVLSCHD_RP) { - Kokkos::parallel_for( - "parfor_fixed_lvl", - Kokkos::RangePolicy(lev_start, lev_end), - ILUKLvlSchedRPNumericFunctor< - ARowMapType, AEntriesType, AValuesType, LRowMapType, - LEntriesType, LValuesType, URowMapType, UEntriesType, - UValuesType, HandleDeviceEntriesType, WorkViewType, nnz_lno_t>( - A_row_map, A_entries, A_values, L_row_map, L_entries, L_values, - U_row_map, U_entries, U_values, level_idx, iw, lev_start)); - } else if (thandle.get_algorithm() == - KokkosSparse::Experimental::SPILUKAlgorithm::SEQLVLSCHD_TP1) { - using policy_type = Kokkos::TeamPolicy; - - nnz_lno_t lvl_rowid_start = 0; - nnz_lno_t lvl_nrows_chunk; + if ((lev_end - lev_start) != 0) { + lno_t lvl_rowid_start = 0; + lno_t lvl_nrows_chunk; for (int chunkid = 0; chunkid < level_nchunks_h(lvl); chunkid++) { if ((lvl_rowid_start + level_nrowsperchunk_h(lvl)) > (lev_end - lev_start)) @@ -422,163 +554,110 @@ void iluk_numeric(IlukHandle &thandle, const ARowMapType &A_row_map, else lvl_nrows_chunk = level_nrowsperchunk_h(lvl); - ILUKLvlSchedTP1NumericFunctor< - ARowMapType, AEntriesType, AValuesType, LRowMapType, LEntriesType, - LValuesType, URowMapType, UEntriesType, UValuesType, - HandleDeviceEntriesType, WorkViewType, nnz_lno_t> - tstf(A_row_map, A_entries, A_values, L_row_map, L_entries, - L_values, U_row_map, U_entries, U_values, level_idx, iw, - lev_start + lvl_rowid_start); - - if (team_size == -1) - Kokkos::parallel_for( - "parfor_tp1", policy_type(lvl_nrows_chunk, Kokkos::AUTO), tstf); - else - Kokkos::parallel_for("parfor_tp1", - policy_type(lvl_nrows_chunk, team_size), tstf); + team_policy tpolicy = get_team_policy(lvl_nrows_chunk, team_size); + KernelLaunchMacro(A_row_map, A_entries, A_values, L_row_map, + L_entries, L_values, U_row_map, U_entries, U_values, + tpolicy, "parfor_tp1", level_idx, iw, + lev_start + lvl_rowid_start, TPF, TPB, + block_enabled, block_size); Kokkos::fence(); lvl_rowid_start += lvl_nrows_chunk; } - } - } // end if - } // end for lvl - //} + } // end if + } // end for lvl // Output check #ifdef NUMERIC_OUTPUT_INFO - std::cout << " iluk_numeric result: " << std::endl; + std::cout << " iluk_numeric result: " << std::endl; - std::cout << " nnzL: " << thandle.get_nnzL() << std::endl; - std::cout << " L_row_map = "; - for (size_type i = 0; i < thandle.get_nrows() + 1; ++i) { - std::cout << L_row_map(i) << " "; - } - std::cout << std::endl; + std::cout << " nnzL: " << thandle.get_nnzL() << std::endl; + std::cout << " L_row_map = "; + for (size_type i = 0; i < thandle.get_nrows() + 1; ++i) { + std::cout << L_row_map(i) << " "; + } + std::cout << std::endl; - std::cout << " L_entries = "; - for (size_type i = 0; i < thandle.get_nnzL(); ++i) { - std::cout << L_entries(i) << " "; - } - std::cout << std::endl; + std::cout << " L_entries = "; + for (size_type i = 0; i < thandle.get_nnzL(); ++i) { + std::cout << L_entries(i) << " "; + } + std::cout << std::endl; - std::cout << " L_values = "; - for (size_type i = 0; i < thandle.get_nnzL(); ++i) { - std::cout << L_values(i) << " "; - } - std::cout << std::endl; + std::cout << " L_values = "; + for (size_type i = 0; i < thandle.get_nnzL(); ++i) { + std::cout << L_values(i) << " "; + } + std::cout << std::endl; - std::cout << " nnzU: " << thandle.get_nnzU() << std::endl; - std::cout << " U_row_map = "; - for (size_type i = 0; i < thandle.get_nrows() + 1; ++i) { - std::cout << U_row_map(i) << " "; - } - std::cout << std::endl; + std::cout << " nnzU: " << thandle.get_nnzU() << std::endl; + std::cout << " U_row_map = "; + for (size_type i = 0; i < thandle.get_nrows() + 1; ++i) { + std::cout << U_row_map(i) << " "; + } + std::cout << std::endl; - std::cout << " U_entries = "; - for (size_type i = 0; i < thandle.get_nnzU(); ++i) { - std::cout << U_entries(i) << " "; - } - std::cout << std::endl; + std::cout << " U_entries = "; + for (size_type i = 0; i < thandle.get_nnzU(); ++i) { + std::cout << U_entries(i) << " "; + } + std::cout << std::endl; - std::cout << " U_values = "; - for (size_type i = 0; i < thandle.get_nnzU(); ++i) { - std::cout << U_values(i) << " "; - } - std::cout << std::endl; + std::cout << " U_values = "; + for (size_type i = 0; i < thandle.get_nnzU(); ++i) { + std::cout << U_values(i) << " "; + } + std::cout << std::endl; #endif -} // end iluk_numeric - -template -void iluk_numeric_streams(const std::vector &execspace_v, - const std::vector &thandle_v, - const std::vector &A_row_map_v, - const std::vector &A_entries_v, - const std::vector &A_values_v, - const std::vector &L_row_map_v, - const std::vector &L_entries_v, - std::vector &L_values_v, - const std::vector &U_row_map_v, - const std::vector &U_entries_v, - std::vector &U_values_v) { - using size_type = typename IlukHandle::size_type; - using nnz_lno_t = typename IlukHandle::nnz_lno_t; - using HandleDeviceEntriesType = typename IlukHandle::nnz_lno_view_t; - using WorkViewType = typename IlukHandle::work_view_t; - using LevelHostViewType = typename IlukHandle::nnz_lno_view_host_t; - - // Create vectors for handles' data in streams - int nstreams = execspace_v.size(); - std::vector nlevels_v(nstreams); - std::vector lvl_ptr_h_v(nstreams); - std::vector lvl_idx_v(nstreams); // device views - std::vector lvl_start_v(nstreams); - std::vector lvl_end_v(nstreams); - std::vector iw_v(nstreams); // device views - std::vector stream_have_level_v(nstreams); - - // Retrieve data from handles and find max. number of levels among streams - size_type nlevels_max = 0; - for (int i = 0; i < nstreams; i++) { - nlevels_v[i] = thandle_v[i]->get_num_levels(); - lvl_ptr_h_v[i] = thandle_v[i]->get_host_level_ptr(); - lvl_idx_v[i] = thandle_v[i]->get_level_idx(); - iw_v[i] = thandle_v[i]->get_iw(); - stream_have_level_v[i] = true; - if (nlevels_max < nlevels_v[i]) nlevels_max = nlevels_v[i]; - } - - // Assume all streams use the same algorithm - if (thandle_v[0]->get_algorithm() == - KokkosSparse::Experimental::SPILUKAlgorithm::SEQLVLSCHD_RP) { - // Main loop must be performed sequential - for (size_type lvl = 0; lvl < nlevels_max; lvl++) { - // Initial work across streams at each level - for (int i = 0; i < nstreams; i++) { - // Only do this if this stream has this level - if (lvl < nlevels_v[i]) { - lvl_start_v[i] = lvl_ptr_h_v[i](lvl); - lvl_end_v[i] = lvl_ptr_h_v[i](lvl + 1); - if ((lvl_end_v[i] - lvl_start_v[i]) != 0) - stream_have_level_v[i] = true; - else - stream_have_level_v[i] = false; - } else - stream_have_level_v[i] = false; - } - - // Main work of the level across streams - // 1. Launch work on all streams - for (int i = 0; i < nstreams; i++) { - // Launch only if stream i-th has this level - if (stream_have_level_v[i]) { - ILUKLvlSchedRPNumericFunctor< - ARowMapType, AEntriesType, AValuesType, LRowMapType, LEntriesType, - LValuesType, URowMapType, UEntriesType, UValuesType, - HandleDeviceEntriesType, WorkViewType, nnz_lno_t> - tstf(A_row_map_v[i], A_entries_v[i], A_values_v[i], - L_row_map_v[i], L_entries_v[i], L_values_v[i], - U_row_map_v[i], U_entries_v[i], U_values_v[i], lvl_idx_v[i], - iw_v[i], lvl_start_v[i]); - Kokkos::parallel_for( - "parfor_rp", - Kokkos::RangePolicy(execspace_v[i], - lvl_start_v[i], lvl_end_v[i]), - tstf); - } // end if (stream_have_level_v[i]) - } // end for streams - } // end for lvl - } // end SEQLVLSCHD_RP - else if (thandle_v[0]->get_algorithm() == - KokkosSparse::Experimental::SPILUKAlgorithm::SEQLVLSCHD_TP1) { - using policy_type = Kokkos::TeamPolicy; + } // end iluk_numeric + + template + static void iluk_numeric_streams( + const std::vector &execspace_v, + const std::vector &thandle_v, + const std::vector &A_row_map_v, + const std::vector &A_entries_v, + const std::vector &A_values_v, + const std::vector &L_row_map_v, + const std::vector &L_entries_v, + std::vector &L_values_v, + const std::vector &U_row_map_v, + const std::vector &U_entries_v, + std::vector &U_values_v) { + using TPF = FunctorTypeMacro(ILUKLvlSchedTP1NumericFunctor, false); + using TPB = FunctorTypeMacro(ILUKLvlSchedTP1NumericFunctor, true); + + // Create vectors for handles' data in streams + int nstreams = execspace_v.size(); + std::vector nlevels_v(nstreams); + std::vector lvl_ptr_h_v(nstreams); + std::vector lvl_idx_v(nstreams); // device views + std::vector lvl_start_v(nstreams); + std::vector lvl_end_v(nstreams); + std::vector iw_v(nstreams); // device views + std::vector stream_have_level_v(nstreams); + std::vector is_block_enabled_v(nstreams); + std::vector block_size_v(nstreams); + + // Retrieve data from handles and find max. number of levels among streams + size_type nlevels_max = 0; + for (int i = 0; i < nstreams; i++) { + nlevels_v[i] = thandle_v[i]->get_num_levels(); + lvl_ptr_h_v[i] = thandle_v[i]->get_host_level_ptr(); + lvl_idx_v[i] = thandle_v[i]->get_level_idx(); + iw_v[i] = thandle_v[i]->get_iw(); + is_block_enabled_v[i] = thandle_v[i]->is_block_enabled(); + block_size_v[i] = thandle_v[i]->get_block_size(); + stream_have_level_v[i] = true; + if (nlevels_max < nlevels_v[i]) nlevels_max = nlevels_v[i]; + } std::vector lvl_nchunks_h_v(nstreams); std::vector lvl_nrowsperchunk_h_v(nstreams); - std::vector lvl_rowid_start_v(nstreams); + std::vector lvl_rowid_start_v(nstreams); std::vector team_size_v(nstreams); for (int i = 0; i < nstreams; i++) { @@ -590,7 +669,7 @@ void iluk_numeric_streams(const std::vector &execspace_v, // Main loop must be performed sequential for (size_type lvl = 0; lvl < nlevels_max; lvl++) { // Initial work across streams at each level - nnz_lno_t lvl_nchunks_max = 0; + lno_t lvl_nchunks_max = 0; for (int i = 0; i < nstreams; i++) { // Only do this if this stream has this level if (lvl < nlevels_v[i]) { @@ -616,7 +695,7 @@ void iluk_numeric_streams(const std::vector &execspace_v, // Launch only if stream i-th has this chunk if (chunkid < lvl_nchunks_h_v[i](lvl)) { // 1.a. Specify number of rows (i.e. number of teams) to launch - nnz_lno_t lvl_nrows_chunk = 0; + lno_t lvl_nrows_chunk = 0; if ((lvl_rowid_start_v[i] + lvl_nrowsperchunk_h_v[i](lvl)) > (lvl_end_v[i] - lvl_start_v[i])) lvl_nrows_chunk = @@ -625,27 +704,14 @@ void iluk_numeric_streams(const std::vector &execspace_v, lvl_nrows_chunk = lvl_nrowsperchunk_h_v[i](lvl); // 1.b. Create functor for stream i-th and launch - ILUKLvlSchedTP1NumericFunctor< - ARowMapType, AEntriesType, AValuesType, LRowMapType, - LEntriesType, LValuesType, URowMapType, UEntriesType, - UValuesType, HandleDeviceEntriesType, WorkViewType, nnz_lno_t> - tstf(A_row_map_v[i], A_entries_v[i], A_values_v[i], - L_row_map_v[i], L_entries_v[i], L_values_v[i], - U_row_map_v[i], U_entries_v[i], U_values_v[i], - lvl_idx_v[i], iw_v[i], - lvl_start_v[i] + lvl_rowid_start_v[i]); - if (team_size_v[i] == -1) - Kokkos::parallel_for( - "parfor_tp1", - policy_type(execspace_v[i], lvl_nrows_chunk, Kokkos::AUTO), - tstf); - else - Kokkos::parallel_for( - "parfor_tp1", - policy_type(execspace_v[i], lvl_nrows_chunk, - team_size_v[i]), - tstf); - + team_policy tpolicy = get_team_policy( + execspace_v[i], lvl_nrows_chunk, team_size_v[i]); + KernelLaunchMacro(A_row_map_v[i], A_entries_v[i], A_values_v[i], + L_row_map_v[i], L_entries_v[i], L_values_v[i], + U_row_map_v[i], U_entries_v[i], U_values_v[i], + tpolicy, "parfor_tp1", lvl_idx_v[i], iw_v[i], + lvl_start_v[i] + lvl_rowid_start_v[i], TPF, TPB, + is_block_enabled_v[i], block_size_v[i]); // 1.c. Ready to move to next chunk lvl_rowid_start_v[i] += lvl_nrows_chunk; } // end if (chunkid < lvl_nchunks_h_v[i](lvl)) @@ -653,12 +719,15 @@ void iluk_numeric_streams(const std::vector &execspace_v, } // end for streams } // end for chunkid } // end for lvl - } // end SEQLVLSCHD_TP1 + } // end iluk_numeric_streams -} // end iluk_numeric_streams +}; // IlukWrap } // namespace Experimental } // namespace Impl } // namespace KokkosSparse +#undef FunctorTypeMacro +#undef KernelLaunchMacro + #endif diff --git a/sparse/impl/KokkosSparse_spiluk_numeric_spec.hpp b/sparse/impl/KokkosSparse_spiluk_numeric_spec.hpp index 12f8c43caf..f58f691e89 100644 --- a/sparse/impl/KokkosSparse_spiluk_numeric_spec.hpp +++ b/sparse/impl/KokkosSparse_spiluk_numeric_spec.hpp @@ -145,6 +145,8 @@ struct SPILUK_NUMERIC { + using Iluk = Experimental::IlukWrap; + static void spiluk_numeric( KernelHandle *handle, const typename KernelHandle::const_nnz_lno_t & /*fill_lev*/, @@ -155,9 +157,9 @@ struct SPILUK_NUMERICget_spiluk_handle(); - Experimental::iluk_numeric(*spiluk_handle, A_row_map, A_entries, A_values, - L_row_map, L_entries, L_values, U_row_map, - U_entries, U_values); + Iluk::iluk_numeric(*spiluk_handle, A_row_map, A_entries, A_values, + L_row_map, L_entries, L_values, U_row_map, U_entries, + U_values); } static void spiluk_numeric_streams( @@ -178,10 +180,10 @@ struct SPILUK_NUMERIC static_cast(L_entries_d.extent(0))) { -#else - if (cntL + lenl > static_cast(L_entries_d.extent(0))) { -#endif // size_type newsize = (size_type) (L_entries_d.extent(0)*EXPAND_FACT); // Kokkos::resize(L_entries, newsize); // Kokkos::resize(L_entries_d, newsize); std::ostringstream os; os << "KokkosSparse::Experimental::spiluk_symbolic: L_entries's extent " "must be larger than " - << L_entries_d.extent(0); + << L_entries_d.extent(0) << ", must be at least " << cntL + lenl + 1; KokkosKernels::Impl::throw_runtime_exception(os.str()); } for (size_type k = 0; k < lenl; ++k) { L_entries(cntL) = h_iL(k); cntL++; } -#ifdef KEEP_DIAG // L diag entry L_entries(cntL) = i; cntL++; -#endif L_row_map(i + 1) = cntL; } // End main loop i diff --git a/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl.hpp b/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl.hpp index 06fe6f094d..85e27f1b1b 100644 --- a/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl.hpp +++ b/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl.hpp @@ -27,7 +27,6 @@ #include namespace KokkosSparse { -namespace Experimental { namespace Impl { struct BsrMatrixSpMVTensorCoreFunctorParams { @@ -519,7 +518,6 @@ struct BsrMatrixSpMVTensorCoreDispatcher { }; } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse #endif // #if CUDA && (VOLTA || AMPERE) @@ -537,7 +535,6 @@ struct BsrMatrixSpMVTensorCoreDispatcher { #include "KokkosKernels_ExecSpaceUtils.hpp" namespace KokkosSparse { -namespace Experimental { namespace Impl { namespace Bsr { @@ -677,13 +674,12 @@ struct BSR_GEMV_Functor { // spMatVec_no_transpose: version for CPU execution spaces // (RangePolicy or trivial serial impl used) // -template ()>::type * = nullptr> void spMatVec_no_transpose( - const typename AD::execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, + const typename AD::execution_space &exec, Handle *handle, const AlphaType &alpha, const KokkosSparse::Experimental::BsrMatrix< AT, AO, AD, Kokkos::MemoryTraits, AS> &A, @@ -704,15 +700,8 @@ void spMatVec_no_transpose( AT, AO, AD, Kokkos::MemoryTraits, AS> AMatrix_Internal; - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; BSR_GEMV_Functor func( alpha, A, x, beta, y, A.blockDim(), useConjugate); @@ -738,13 +727,12 @@ void spMatVec_no_transpose( // // spMatVec_no_transpose: version for GPU execution spaces (TeamPolicy used) // -template ()>::type * = nullptr> void spMatVec_no_transpose( - const typename AD::execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, + const typename AD::execution_space &exec, Handle *handle, const AlphaType &alpha, const KokkosSparse::Experimental::BsrMatrix< AT, AO, AD, Kokkos::MemoryTraits, AS> &A, @@ -758,15 +746,9 @@ void spMatVec_no_transpose( AMatrix_Internal; typedef typename AMatrix_Internal::execution_space execution_space; - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; + int team_size = -1; int vector_length = -1; const auto block_dim = A.blockDim(); @@ -788,14 +770,10 @@ void spMatVec_no_transpose( int64_t worksets = A.numRows(); // - // Use the controls to allow the user to pass in some tuning parameters. + // Use the handle to allow the user to pass in some tuning parameters. // - if (controls.isParameter("team size")) { - team_size = std::stoi(controls.getParameter("team size")); - } - if (controls.isParameter("vector length")) { - vector_length = std::stoi(controls.getParameter("vector length")); - } + if (handle->team_size != -1) team_size = handle->team_size; + if (handle->vector_length != -1) vector_length = handle->vector_length; BSR_GEMV_Functor func( alpha, A, x, beta, y, block_dim, useConjugate); @@ -990,13 +968,12 @@ struct BSR_GEMV_Transpose_Functor { /// \brief spMatVec_transpose: version for CPU execution spaces (RangePolicy or /// trivial serial impl used) -template ()>::type * = nullptr> void spMatVec_transpose( - const typename AD::execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, + const typename AD::execution_space &exec, Handle *handle, const AlphaType &alpha, const KokkosSparse::Experimental::BsrMatrix< AT, AO, AD, Kokkos::MemoryTraits, AS> &A, @@ -1019,15 +996,8 @@ void spMatVec_transpose( AT, AO, AD, Kokkos::MemoryTraits, AS> AMatrix_Internal; - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; BSR_GEMV_Transpose_Functor func( alpha, A, x, y, useConjugate); @@ -1051,15 +1021,14 @@ void spMatVec_transpose( // // spMatVec_transpose: version for GPU execution spaces (TeamPolicy used) // -template ()>::type * = nullptr> void spMatVec_transpose(const typename AMatrix::execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, - const AlphaType &alpha, const AMatrix &A, - const XVector &x, const BetaType &beta, YVector &y, - bool useConjugate) { + Handle *handle, const AlphaType &alpha, + const AMatrix &A, const XVector &x, + const BetaType &beta, YVector &y, bool useConjugate) { if (A.numRows() <= 0) { return; } @@ -1073,17 +1042,10 @@ void spMatVec_transpose(const typename AMatrix::execution_space &exec, else if (beta != Kokkos::ArithTraits::one()) KokkosBlas::scal(exec, y, beta, y); - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } - int team_size = -1; - int vector_length = -1; + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; + int team_size = -1; + int vector_length = -1; int64_t worksets = A.numRows(); @@ -1104,14 +1066,10 @@ void spMatVec_transpose(const typename AMatrix::execution_space &exec, } // - // Use the controls to allow the user to pass in some tuning parameters. + // Use the handle to allow the user to pass in some tuning parameters. // - if (controls.isParameter("team size")) { - team_size = std::stoi(controls.getParameter("team size")); - } - if (controls.isParameter("vector length")) { - vector_length = std::stoi(controls.getParameter("vector length")); - } + if (handle->team_size != -1) team_size = handle->team_size; + if (handle->vector_length != -1) vector_length = handle->vector_length; BSR_GEMV_Transpose_Functor func(alpha, A, x, y, useConjugate); @@ -1319,13 +1277,12 @@ struct BSR_GEMM_Functor { // spMatMultiVec_no_transpose: version for CPU execution spaces // (RangePolicy or trivial serial impl used) // -template ()>::type * = nullptr> void spMatMultiVec_no_transpose( - const typename AD::execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, + const typename AD::execution_space &exec, Handle *handle, const AlphaType &alpha, const KokkosSparse::Experimental::BsrMatrix< AT, AO, AD, Kokkos::MemoryTraits, AS> &A, @@ -1344,15 +1301,8 @@ void spMatMultiVec_no_transpose( AT, AO, AD, Kokkos::MemoryTraits, AS> AMatrix_Internal; - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; BSR_GEMM_Functor func(alpha, A, x, beta, y, useConjugate); @@ -1379,13 +1329,12 @@ void spMatMultiVec_no_transpose( // spMatMultiVec_no_transpose: version for GPU execution spaces (TeamPolicy // used) // -template ()>::type * = nullptr> void spMatMultiVec_no_transpose( - const typename AD::execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, + const typename AD::execution_space &exec, Handle *handle, const AlphaType &alpha, const KokkosSparse::Experimental::BsrMatrix< AT, AO, AD, Kokkos::MemoryTraits, AS> &A, @@ -1399,15 +1348,10 @@ void spMatMultiVec_no_transpose( AMatrix_Internal; typedef typename AMatrix_Internal::execution_space execution_space; - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = + handle->force_dynamic_schedule; // Forces the use of a dynamic schedule + bool use_static_schedule = + handle->force_static_schedule; // Forces the use of a static schedule int team_size = -1; int vector_length = -1; @@ -1429,14 +1373,10 @@ void spMatMultiVec_no_transpose( } // - // Use the controls to allow the user to pass in some tuning parameters. + // Use the handle to allow the user to pass in some tuning parameters. // - if (controls.isParameter("team size")) { - team_size = std::stoi(controls.getParameter("team size")); - } - if (controls.isParameter("vector length")) { - vector_length = std::stoi(controls.getParameter("vector length")); - } + if (handle->team_size != -1) team_size = handle->team_size; + if (handle->vector_length != -1) vector_length = handle->vector_length; BSR_GEMM_Functor func(alpha, A, x, beta, y, useConjugate); @@ -1649,14 +1589,13 @@ struct BSR_GEMM_Transpose_Functor { /// \brief spMatMultiVec_transpose: version for CPU execution spaces /// (RangePolicy or trivial serial impl used) -template ()>::type * = nullptr> void spMatMultiVec_transpose( - const execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, - const AlphaType &alpha, + const execution_space &exec, Handle *handle, const AlphaType &alpha, const KokkosSparse::Experimental::BsrMatrix< AT, AO, AD, Kokkos::MemoryTraits, AS> &A, const XVector &x, const BetaType &beta, YVector &y, bool useConjugate) { @@ -1674,15 +1613,8 @@ void spMatMultiVec_transpose( AT, AO, AD, Kokkos::MemoryTraits, AS> AMatrix_Internal; - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; BSR_GEMM_Transpose_Functor @@ -1705,15 +1637,14 @@ void spMatMultiVec_transpose( // // spMatMultiVec_transpose: version for GPU execution spaces (TeamPolicy used) // -template ()>::type * = nullptr> -void spMatMultiVec_transpose( - const execution_space &exec, - const KokkosKernels::Experimental::Controls &controls, - const AlphaType &alpha, const AMatrix &A, const XVector &x, - const BetaType &beta, YVector &y, bool useConjugate) { +void spMatMultiVec_transpose(const execution_space &exec, Handle *handle, + const AlphaType &alpha, const AMatrix &A, + const XVector &x, const BetaType &beta, YVector &y, + bool useConjugate) { if (A.numRows() <= 0) { return; } @@ -1723,18 +1654,11 @@ void spMatMultiVec_transpose( else if (beta != Kokkos::ArithTraits::one()) KokkosBlas::scal(exec, y, beta, y); - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } - int team_size = -1; - int vector_length = -1; - int64_t worksets = A.numRows(); + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; + int team_size = -1; + int vector_length = -1; + int64_t worksets = A.numRows(); const auto block_dim = A.blockDim(); if (block_dim <= 4) { @@ -1752,15 +1676,10 @@ void spMatMultiVec_transpose( } // - // Use the controls to allow the user to pass in some tuning - // parameters. + // Use the handle to allow the user to pass in some tuning parameters. // - if (controls.isParameter("team size")) { - team_size = std::stoi(controls.getParameter("team size")); - } - if (controls.isParameter("vector length")) { - vector_length = std::stoi(controls.getParameter("vector length")); - } + if (handle->team_size != -1) team_size = handle->team_size; + if (handle->vector_length != -1) vector_length = handle->vector_length; BSR_GEMM_Transpose_Functor func( alpha, A, x, y, useConjugate); @@ -1813,9 +1732,7 @@ void spMatMultiVec_transpose( /* ******************* */ } // namespace Bsr - } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse #endif // KOKKOSSPARSE_IMPL_SPMV_BSRMATRIX_IMPL_HPP_ diff --git a/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl_v42.hpp b/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl_v42.hpp index 1c0d2fc361..a0f4ed1540 100644 --- a/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl_v42.hpp +++ b/sparse/impl/KokkosSparse_spmv_bsrmatrix_impl_v42.hpp @@ -121,11 +121,6 @@ void apply_v42(const typename AMatrix::execution_space &exec, Kokkos::RangePolicy policy(exec, 0, y.size()); if constexpr (YVector::rank == 1) { -// lbv - 07/26/2023: -// with_unmanaged_t<...> required Kokkos 4.1.0, -// the content of this header will be guarded -// until v4.3.0 -#if KOKKOS_VERSION >= 40100 || defined(DOXY) // Implementation expects a 2D view, so create an unmanaged 2D view // with extent 1 in the second dimension using Y2D = KokkosKernels::Impl::with_unmanaged_t>; -#else - // Implementation expects a 2D view, so create an unmanaged 2D view - // with extent 1 in the second dimension - using Y2D = Kokkos::View< - typename YVector::value_type * [1], typename YVector::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits>; - using X2D = Kokkos::View< - typename XVector::value_type * [1], typename XVector::array_layout, - typename XVector::device_type, Kokkos::MemoryTraits>; -#endif // KOKKOS_VERSION >= 40100 || defined(DOXY) const Y2D yu(y.data(), y.extent(0), 1); const X2D xu(x.data(), x.extent(0), 1); BsrSpmvV42NonTrans op(alpha, a, xu, beta, yu); diff --git a/sparse/impl/KokkosSparse_spmv_bsrmatrix_spec.hpp b/sparse/impl/KokkosSparse_spmv_bsrmatrix_spec.hpp index 564100879e..5c2bf0edfa 100644 --- a/sparse/impl/KokkosSparse_spmv_bsrmatrix_spec.hpp +++ b/sparse/impl/KokkosSparse_spmv_bsrmatrix_spec.hpp @@ -21,7 +21,7 @@ #include #include "KokkosSparse_BsrMatrix.hpp" -#include "KokkosKernels_Controls.hpp" +#include "KokkosSparse_spmv_handle.hpp" #include "KokkosKernels_Error.hpp" #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY #include @@ -29,32 +29,32 @@ #endif namespace KokkosSparse { -namespace Experimental { namespace Impl { // default is no eti available -template +template struct spmv_bsrmatrix_eti_spec_avail { enum : bool { value = false }; }; -template > struct spmv_mv_bsrmatrix_eti_spec_avail { enum : bool { value = false }; }; -} // namespace Impl -} // namespace Experimental -} // namespace KokkosSparse - #define KOKKOSSPARSE_SPMV_BSRMATRIX_ETI_SPEC_AVAIL( \ SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, EXEC_SPACE_TYPE, \ MEM_SPACE_TYPE) \ template <> \ struct spmv_bsrmatrix_eti_spec_avail< \ EXEC_SPACE_TYPE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR_TYPE, const ORDINAL_TYPE, \ Kokkos::Device, \ @@ -75,6 +75,9 @@ struct spmv_mv_bsrmatrix_eti_spec_avail { template <> \ struct spmv_mv_bsrmatrix_eti_spec_avail< \ EXEC_SPACE_TYPE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR_TYPE, const ORDINAL_TYPE, \ Kokkos::Device, \ @@ -89,86 +92,83 @@ struct spmv_mv_bsrmatrix_eti_spec_avail { enum : bool { value = true }; \ }; +} // namespace Impl +} // namespace KokkosSparse + // Include which ETIs are available #include #include #include namespace KokkosSparse { -namespace Experimental { namespace Impl { // declaration -template ::value, + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value, bool eti_spec_avail = spmv_bsrmatrix_eti_spec_avail< - ExecutionSpace, AMatrix, XVector, YVector>::value> + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value> struct SPMV_BSRMATRIX { typedef typename YVector::non_const_value_type YScalar; - static void spmv_bsrmatrix( - const ExecutionSpace &space, - const KokkosKernels::Experimental::Controls &controls, const char mode[], - const YScalar &alpha, const AMatrix &A, const XVector &x, - const YScalar &beta, const YVector &y); + static void spmv_bsrmatrix(const ExecutionSpace &space, Handle *handle, + const char mode[], const YScalar &alpha, + const AMatrix &A, const XVector &x, + const YScalar &beta, const YVector &y); }; // declaration -template , bool tpl_spec_avail = spmv_mv_bsrmatrix_tpl_spec_avail< - ExecutionSpace, AMatrix, XVector, YVector>::value, + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value, bool eti_spec_avail = spmv_mv_bsrmatrix_eti_spec_avail< - ExecutionSpace, AMatrix, XVector, YVector>::value> + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value> struct SPMV_MV_BSRMATRIX { typedef typename YVector::non_const_value_type YScalar; - static void spmv_mv_bsrmatrix( - const ExecutionSpace &space, - const KokkosKernels::Experimental::Controls &controls, const char mode[], - const YScalar &alpha, const AMatrix &A, const XVector &x, - const YScalar &beta, const YVector &y); + static void spmv_mv_bsrmatrix(const ExecutionSpace &space, Handle *handle, + const char mode[], const YScalar &alpha, + const AMatrix &A, const XVector &x, + const YScalar &beta, const YVector &y); }; // actual implementations to be compiled #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY -// these should all be different -constexpr inline const char *ALG_V41 = "v4.1"; -constexpr inline const char *ALG_V42 = "v4.2"; -constexpr inline const char *ALG_TC = "experimental_bsr_tc"; - -template -struct SPMV_BSRMATRIX +struct SPMV_BSRMATRIX { typedef typename YVector::non_const_value_type YScalar; - static void spmv_bsrmatrix( - const ExecutionSpace &space, - const KokkosKernels::Experimental::Controls &controls, const char mode[], - const YScalar &alpha, const AMatrix &A, const XVector &X, - const YScalar &beta, const YVector &Y) { + static void spmv_bsrmatrix(const ExecutionSpace &space, Handle *handle, + const char mode[], const YScalar &alpha, + const AMatrix &A, const XVector &X, + const YScalar &beta, const YVector &Y) { const bool modeIsNoTrans = (mode[0] == NoTranspose[0]); const bool modeIsConjugate = (mode[0] == Conjugate[0]); const bool modeIsConjugateTrans = (mode[0] == ConjugateTranspose[0]); const bool modeIsTrans = (mode[0] == Transpose[0]); // use V41 if requested - if (controls.getParameter("algorithm") == ALG_V41) { + if (handle->algo == SPMV_BSR_V41) { if (modeIsNoTrans || modeIsConjugate) { - return Bsr::spMatVec_no_transpose(space, controls, alpha, A, X, beta, Y, + return Bsr::spMatVec_no_transpose(space, handle, alpha, A, X, beta, Y, modeIsConjugate); } else if (modeIsTrans || modeIsConjugateTrans) { - return Bsr::spMatVec_transpose(space, controls, alpha, A, X, beta, Y, + return Bsr::spMatVec_transpose(space, handle, alpha, A, X, beta, Y, modeIsConjugateTrans); } } // use V42 if possible if (KokkosKernels::Impl::kk_is_gpu_exec_space() || - controls.getParameter("algorithm") == ALG_V42) { + handle->algo == SPMV_BSR_V42) { if (modeIsNoTrans) { ::KokkosSparse::Impl::apply_v42(space, alpha, A, X, beta, Y); return; @@ -177,10 +177,10 @@ struct SPMV_BSRMATRIX -struct SPMV_MV_BSRMATRIX { +template +struct SPMV_MV_BSRMATRIX { typedef typename YVector::non_const_value_type YScalar; enum class Method { @@ -204,27 +205,18 @@ struct SPMV_MV_BSRMATRIX::value) { + if (handle->algo == SPMV_BSR_TC) method = Method::TensorCores; + if (!KokkosSparse::Impl::TensorCoresAvailable::value) { method = Method::Fallback; } // can't use tensor cores unless mode is no-transpose @@ -249,28 +241,23 @@ struct SPMV_MV_BSRMATRIXbsr_tc_precision; switch (precision) { - case Precision::Mixed: { + case KokkosSparse::Experimental::Bsr_TC_Precision::Mixed: { BsrMatrixSpMVTensorCoreDispatcher::dispatch(space, alpha, A, X, beta, Y); return; } - case Precision::Double: { + case KokkosSparse::Experimental::Bsr_TC_Precision::Double: { BsrMatrixSpMVTensorCoreDispatcher::dispatch(space, alpha, A, X, beta, Y); return; } - case Precision::Automatic: // fallthrough + case KokkosSparse::Experimental::Bsr_TC_Precision::Automatic: default: { constexpr bool operandsHalfHalfFloat = std::is_same::value && @@ -312,19 +299,19 @@ struct SPMV_MV_BSRMATRIXalgo == SPMV_BSR_V41) { if (modeIsNoTrans || modeIsConjugate) { - return Bsr::spMatMultiVec_no_transpose(space, controls, alpha, A, X, - beta, Y, modeIsConjugate); + return Bsr::spMatMultiVec_no_transpose(space, handle, alpha, A, X, beta, + Y, modeIsConjugate); } else if (modeIsTrans || modeIsConjugateTrans) { - return Bsr::spMatMultiVec_transpose(space, controls, alpha, A, X, beta, - Y, modeIsConjugateTrans); + return Bsr::spMatMultiVec_transpose(space, handle, alpha, A, X, beta, Y, + modeIsConjugateTrans); } } // use V42 if possible if (KokkosKernels::Impl::kk_is_gpu_exec_space() || - controls.getParameter("algorithm") == ALG_V42) { + handle->algo == SPMV_BSR_V42) { if (modeIsNoTrans) { ::KokkosSparse::Impl::apply_v42(space, alpha, A, X, beta, Y); return; @@ -333,10 +320,10 @@ struct SPMV_MV_BSRMATRIX -struct SPMV_MV_BSRMATRIX { +template +struct SPMV_MV_BSRMATRIX { typedef typename YVector::non_const_value_type YScalar; - static void spmv_mv_bsrmatrix( - const ExecutionSpace &space, - const KokkosKernels::Experimental::Controls &controls, const char mode[], - const YScalar &alpha, const AMatrix &A, const XVector &X, - const YScalar &beta, const YVector &Y) { + static void spmv_mv_bsrmatrix(const ExecutionSpace &space, Handle *handle, + const char mode[], const YScalar &alpha, + const AMatrix &A, const XVector &X, + const YScalar &beta, const YVector &Y) { static_assert(std::is_integral_v, "This implementation is only for integer Scalar types."); - for (typename AMatrix::non_const_size_type j = 0; j < X.extent(1); ++j) { + for (size_t j = 0; j < X.extent(1); ++j) { const auto x_j = Kokkos::subview(X, Kokkos::ALL(), j); auto y_j = Kokkos::subview(Y, Kokkos::ALL(), j); - typedef SPMV_BSRMATRIX impl_type; - impl_type::spmv_bsrmatrix(space, controls, mode, alpha, A, x_j, beta, - y_j); + impl_type::spmv_bsrmatrix(space, handle, mode, alpha, A, x_j, beta, y_j); } } }; #endif // !defined(KOKKOSKERNELS_ETI_ONLY) || // KOKKOSKERNELS_IMPL_COMPILE_LIBRARY } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse // declare / instantiate the vector version @@ -387,6 +372,9 @@ struct SPMV_MV_BSRMATRIX, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR_TYPE, const ORDINAL_TYPE, \ Kokkos::Device, \ @@ -405,6 +393,9 @@ struct SPMV_MV_BSRMATRIX, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR_TYPE, const ORDINAL_TYPE, \ Kokkos::Device, \ @@ -426,6 +417,9 @@ struct SPMV_MV_BSRMATRIX, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR_TYPE, const ORDINAL_TYPE, \ Kokkos::Device, \ @@ -444,6 +438,9 @@ struct SPMV_MV_BSRMATRIX, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR_TYPE, const ORDINAL_TYPE, \ Kokkos::Device, \ diff --git a/sparse/impl/KokkosSparse_spmv_impl.hpp b/sparse/impl/KokkosSparse_spmv_impl.hpp index 4f90002a61..5f9cbea040 100644 --- a/sparse/impl/KokkosSparse_spmv_impl.hpp +++ b/sparse/impl/KokkosSparse_spmv_impl.hpp @@ -24,6 +24,7 @@ #include "KokkosBlas1_scal.hpp" #include "KokkosKernels_ExecSpaceUtils.hpp" #include "KokkosSparse_CrsMatrix.hpp" +#include "KokkosSparse_spmv_handle.hpp" #include "KokkosSparse_spmv_impl_omp.hpp" #include "KokkosSparse_spmv_impl_merge.hpp" #include "KokkosKernels_Error.hpp" @@ -249,16 +250,15 @@ int64_t spmv_launch_parameters(int64_t numRows, int64_t nnz, // spmv_beta_no_transpose: version for CPU execution spaces (RangePolicy or // trivial serial impl used) -template ()>::type* = nullptr> -static void spmv_beta_no_transpose( - const execution_space& exec, - const KokkosKernels::Experimental::Controls& controls, - typename YVector::const_value_type& alpha, const AMatrix& A, - const XVector& x, typename YVector::const_value_type& beta, - const YVector& y) { +static void spmv_beta_no_transpose(const execution_space& exec, Handle* handle, + typename YVector::const_value_type& alpha, + const AMatrix& A, const XVector& x, + typename YVector::const_value_type& beta, + const YVector& y) { typedef typename AMatrix::non_const_ordinal_type ordinal_type; if (A.numRows() <= static_cast(0)) { @@ -363,15 +363,8 @@ static void spmv_beta_no_transpose( } #endif - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; SPMV_Functor func(alpha, A, x, beta, y, 1); if (((A.nnz() > 10000000) || use_dynamic_schedule) && !use_static_schedule) @@ -389,47 +382,26 @@ static void spmv_beta_no_transpose( } // spmv_beta_no_transpose: version for GPU execution spaces (TeamPolicy used) -template ()>::type* = nullptr> -static void spmv_beta_no_transpose( - const execution_space& exec, - const KokkosKernels::Experimental::Controls& controls, - typename YVector::const_value_type& alpha, const AMatrix& A, - const XVector& x, typename YVector::const_value_type& beta, - const YVector& y) { +static void spmv_beta_no_transpose(const execution_space& exec, Handle* handle, + typename YVector::const_value_type& alpha, + const AMatrix& A, const XVector& x, + typename YVector::const_value_type& beta, + const YVector& y) { typedef typename AMatrix::non_const_ordinal_type ordinal_type; if (A.numRows() <= static_cast(0)) { return; } - bool use_dynamic_schedule = false; // Forces the use of a dynamic schedule - bool use_static_schedule = false; // Forces the use of a static schedule - if (controls.isParameter("schedule")) { - if (controls.getParameter("schedule") == "dynamic") { - use_dynamic_schedule = true; - } else if (controls.getParameter("schedule") == "static") { - use_static_schedule = true; - } - } - int team_size = -1; - int vector_length = -1; - int64_t rows_per_thread = -1; - - // Note on 03/24/20, lbv: We can use the controls - // here to allow the user to pass in some tunning - // parameters. - if (controls.isParameter("team size")) { - team_size = std::stoi(controls.getParameter("team size")); - } - if (controls.isParameter("vector length")) { - vector_length = std::stoi(controls.getParameter("vector length")); - } - if (controls.isParameter("rows per thread")) { - rows_per_thread = std::stoll(controls.getParameter("rows per thread")); - } + bool use_dynamic_schedule = handle->force_dynamic_schedule; + bool use_static_schedule = handle->force_static_schedule; + int team_size = handle->team_size; + int vector_length = handle->vector_length; + int64_t rows_per_thread = handle->rows_per_thread; int64_t rows_per_team = spmv_launch_parameters( A.numRows(), A.nnz(), rows_per_thread, team_size, vector_length); @@ -622,30 +594,29 @@ static void spmv_beta_transpose(const execution_space& exec, op); } -template -static void spmv_beta(const execution_space& exec, - const KokkosKernels::Experimental::Controls& controls, +template +static void spmv_beta(const execution_space& exec, Handle* handle, const char mode[], typename YVector::const_value_type& alpha, const AMatrix& A, const XVector& x, typename YVector::const_value_type& beta, const YVector& y) { if (mode[0] == NoTranspose[0]) { - if (controls.getParameter("algorithm") == KOKKOSSPARSE_ALG_NATIVE_MERGE) { + if (handle->algo == SPMV_MERGE_PATH) { SpmvMergeHierarchical::spmv( exec, mode, alpha, A, x, beta, y); } else { - spmv_beta_no_transpose(exec, controls, alpha, A, x, beta, y); + spmv_beta_no_transpose(exec, handle, alpha, A, x, beta, y); } } else if (mode[0] == Conjugate[0]) { - if (controls.getParameter("algorithm") == KOKKOSSPARSE_ALG_NATIVE_MERGE) { + if (handle->algo == SPMV_MERGE_PATH) { SpmvMergeHierarchical::spmv( exec, mode, alpha, A, x, beta, y); } else { - spmv_beta_no_transpose(exec, controls, alpha, A, x, beta, y); + spmv_beta_no_transpose(exec, handle, alpha, A, x, beta, y); } } else if (mode[0] == Transpose[0]) { spmv_beta_transpose #include "KokkosSparse_CrsMatrix.hpp" -#include "KokkosKernels_Controls.hpp" +#include "KokkosSparse_spmv_handle.hpp" // Include the actual functors #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY #include @@ -30,11 +30,13 @@ namespace KokkosSparse { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct spmv_eti_spec_avail { enum : bool { value = false }; }; -template > struct spmv_mv_eti_spec_avail { @@ -50,6 +52,9 @@ struct spmv_mv_eti_spec_avail { template <> \ struct spmv_eti_spec_avail< \ EXEC_SPACE_TYPE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ @@ -70,6 +75,9 @@ struct spmv_mv_eti_spec_avail { template <> \ struct spmv_mv_eti_spec_avail< \ EXEC_SPACE_TYPE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ @@ -100,17 +108,16 @@ namespace Impl { /// /// For the implementation of KokkosSparse::spmv for multivectors (2-D /// Views), see the SPMV_MV struct below. -template < - class ExecutionSpace, class AMatrix, class XVector, class YVector, - bool tpl_spec_avail = - spmv_tpl_spec_avail::value, - bool eti_spec_avail = - spmv_eti_spec_avail::value> +template ::value, + bool eti_spec_avail = spmv_eti_spec_avail< + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value> struct SPMV { typedef typename YVector::non_const_value_type coefficient_type; - static void spmv(const ExecutionSpace& space, - const KokkosKernels::Experimental::Controls& controls, + static void spmv(const ExecutionSpace& space, Handle* handle, const char mode[], const coefficient_type& alpha, const AMatrix& A, const XVector& x, const coefficient_type& beta, const YVector& y); @@ -140,18 +147,18 @@ struct SPMV { /// matrix's entries have integer type. Per Github Issue #700, we /// don't optimize as heavily for that case, in order to reduce build /// times and library sizes. -template , - bool tpl_spec_avail = spmv_mv_tpl_spec_avail::value, - bool eti_spec_avail = spmv_mv_eti_spec_avail::value> + bool tpl_spec_avail = spmv_mv_tpl_spec_avail< + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value, + bool eti_spec_avail = spmv_mv_eti_spec_avail< + ExecutionSpace, Handle, AMatrix, XVector, YVector>::value> struct SPMV_MV { typedef typename YVector::non_const_value_type coefficient_type; - static void spmv_mv(const ExecutionSpace& space, - const KokkosKernels::Experimental::Controls& controls, + static void spmv_mv(const ExecutionSpace& space, Handle* handle, const char mode[], const coefficient_type& alpha, const AMatrix& A, const XVector& x, const coefficient_type& beta, const YVector& y); @@ -160,90 +167,114 @@ struct SPMV_MV { #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY //! Full specialization of spmv for single vectors (1-D Views). // Unification layer -template -struct SPMV +struct SPMV { typedef typename YVector::non_const_value_type coefficient_type; - static void spmv(const ExecutionSpace& space, - const KokkosKernels::Experimental::Controls& controls, + static void spmv(const ExecutionSpace& space, Handle* handle, const char mode[], const coefficient_type& alpha, const AMatrix& A, const XVector& x, const coefficient_type& beta, const YVector& y) { typedef Kokkos::ArithTraits KAT; - if (alpha == KAT::zero()) { - if (beta != KAT::one()) { - KokkosBlas::scal(space, y, beta, y); - } - return; - } - if (beta == KAT::zero()) { - spmv_beta( - space, controls, mode, alpha, A, x, beta, y); + spmv_beta( + space, handle, mode, alpha, A, x, beta, y); } else if (beta == KAT::one()) { - spmv_beta( - space, controls, mode, alpha, A, x, beta, y); + spmv_beta( + space, handle, mode, alpha, A, x, beta, y); } else if (beta == -KAT::one()) { - spmv_beta( - space, controls, mode, alpha, A, x, beta, y); + spmv_beta( + space, handle, mode, alpha, A, x, beta, y); } else { - spmv_beta( - space, controls, mode, alpha, A, x, beta, y); + spmv_beta( + space, handle, mode, alpha, A, x, beta, y); } } }; //! Full specialization of spmv_mv for single vectors (2-D Views). // Unification layer -template -struct SPMV_MV +struct SPMV_MV { typedef typename YVector::non_const_value_type coefficient_type; - static void spmv_mv(const ExecutionSpace& space, - const KokkosKernels::Experimental::Controls& /*controls*/, + static void spmv_mv(const ExecutionSpace& space, Handle* handle, const char mode[], const coefficient_type& alpha, const AMatrix& A, const XVector& x, const coefficient_type& beta, const YVector& y) { typedef Kokkos::ArithTraits KAT; - - if (alpha == KAT::zero()) { - spmv_alpha_mv( - space, mode, alpha, A, x, beta, y); - } else if (alpha == KAT::one()) { - spmv_alpha_mv( - space, mode, alpha, A, x, beta, y); - } else if (alpha == -KAT::one()) { - spmv_alpha_mv( - space, mode, alpha, A, x, beta, y); + // Intercept special case: if x/y have only 1 column and both are + // contiguous, use the more efficient single-vector impl. + // + // We cannot do this if x or y is noncontiguous, because the column subview + // must be LayoutStride which is not ETI'd. + // + // Do not use a TPL even if one is available for the types: + // we don't want the same handle being used in both TPL and non-TPL versions + if (x.extent(1) == size_t(1) && x.span_is_contiguous() && + y.span_is_contiguous()) { + Kokkos::View + x0(x.data(), x.extent(0)); + Kokkos::View + y0(y.data(), y.extent(0)); + if (beta == KAT::zero()) { + spmv_beta(space, handle, mode, alpha, A, x0, beta, y0); + } else if (beta == KAT::one()) { + spmv_beta(space, handle, mode, alpha, A, x0, beta, y0); + } else if (beta == -KAT::one()) { + spmv_beta(space, handle, mode, alpha, A, x0, beta, y0); + } else { + spmv_beta(space, handle, mode, alpha, A, x0, beta, y0); + } } else { - spmv_alpha_mv( - space, mode, alpha, A, x, beta, y); + if (alpha == KAT::zero()) { + spmv_alpha_mv( + space, mode, alpha, A, x, beta, y); + } else if (alpha == KAT::one()) { + spmv_alpha_mv( + space, mode, alpha, A, x, beta, y); + } else if (alpha == -KAT::one()) { + spmv_alpha_mv( + space, mode, alpha, A, x, beta, y); + } else { + spmv_alpha_mv( + space, mode, alpha, A, x, beta, y); + } } } }; -template -struct SPMV_MV +struct SPMV_MV { typedef typename YVector::non_const_value_type coefficient_type; - static void spmv_mv(const ExecutionSpace& space, - const KokkosKernels::Experimental::Controls& /*controls*/, + static void spmv_mv(const ExecutionSpace& space, Handle* handle, const char mode[], const coefficient_type& alpha, const AMatrix& A, const XVector& x, const coefficient_type& beta, const YVector& y) { static_assert(std::is_integral_v, "This implementation is only for integer Scalar types."); KokkosKernels::Experimental::Controls defaultControls; - for (typename AMatrix::non_const_size_type j = 0; j < x.extent(1); ++j) { + for (size_t j = 0; j < x.extent(1); ++j) { auto x_j = Kokkos::subview(x, Kokkos::ALL(), j); auto y_j = Kokkos::subview(y, Kokkos::ALL(), j); - typedef SPMV + typedef SPMV impl_type; - impl_type::spmv(space, defaultControls, mode, alpha, A, x_j, beta, y_j); + impl_type::spmv(space, handle, mode, alpha, A, x_j, beta, y_j); } } }; @@ -264,6 +295,9 @@ struct SPMV_MV, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ @@ -282,6 +316,9 @@ struct SPMV_MV, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ @@ -300,6 +337,9 @@ struct SPMV_MV, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ @@ -318,6 +358,9 @@ struct SPMV_MV, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ diff --git a/sparse/impl/KokkosSparse_sptrsv_cuSPARSE_impl.hpp b/sparse/impl/KokkosSparse_sptrsv_cuSPARSE_impl.hpp index 7605f03fa2..019a63fcd7 100644 --- a/sparse/impl/KokkosSparse_sptrsv_cuSPARSE_impl.hpp +++ b/sparse/impl/KokkosSparse_sptrsv_cuSPARSE_impl.hpp @@ -22,10 +22,11 @@ namespace KokkosSparse { namespace Impl { -template -void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, +void sptrsvcuSPARSE_symbolic(ExecutionSpace &space, KernelHandle *sptrsv_handle, typename KernelHandle::nnz_lno_t nrows, ain_row_index_view_type row_map, ain_nonzero_index_view_type entries, @@ -61,6 +62,9 @@ void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, typename KernelHandle::SPTRSVcuSparseHandleType *h = sptrsv_handle->get_cuSparseHandle(); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetStream(h->handle, space.cuda_stream())); + int64_t nnz = static_cast(entries.extent(0)); size_t pBufferSize; void *rm; @@ -98,13 +102,13 @@ void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, CUSPARSE_INDEX_BASE_ZERO, cudaValueType)); // Create dummy dense vector B (RHS) - nnz_scalar_view_t b_dummy("b_dummy", nrows); + nnz_scalar_view_t b_dummy(Kokkos::view_alloc(space, "b_dummy"), nrows); KOKKOS_CUSPARSE_SAFE_CALL( cusparseCreateDnVec(&(h->vecBDescr_dummy), static_cast(nrows), b_dummy.data(), cudaValueType)); // Create dummy dense vector X (LHS) - nnz_scalar_view_t x_dummy("x_dummy", nrows); + nnz_scalar_view_t x_dummy(Kokkos::view_alloc(space, "x_dummy"), nrows); KOKKOS_CUSPARSE_SAFE_CALL( cusparseCreateDnVec(&(h->vecXDescr_dummy), static_cast(nrows), x_dummy.data(), cudaValueType)); @@ -155,17 +159,20 @@ void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, std::is_same::value || std::is_same::value; - if (!is_cuda_space) { + if constexpr (!is_cuda_space) { throw std::runtime_error( "KokkosKernels sptrsvcuSPARSE_symbolic: MEMORY IS NOT ALLOCATED IN GPU " "DEVICE for CUSPARSE\n"); - } else if (std::is_same::value) { + } else if constexpr (std::is_same::value) { bool is_lower = sptrsv_handle->is_lower_tri(); sptrsv_handle->create_cuSPARSE_Handle(trans, is_lower); - typename KernelHandle::SPTRSVcuSparseHandleType* h = + typename KernelHandle::SPTRSVcuSparseHandleType *h = sptrsv_handle->get_cuSparseHandle(); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetStream(h->handle, space.cuda_stream())); + cusparseStatus_t status; status = cusparseCreateCsrsv2Info(&(h->info)); if (CUSPARSE_STATUS_SUCCESS != status) @@ -178,85 +185,86 @@ void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, if (!std::is_same::value) sptrsv_handle->allocate_tmp_int_rowmap(row_map.extent(0)); - const int* rm = !std::is_same::value + const int *rm = !std::is_same::value ? sptrsv_handle->get_int_rowmap_ptr_copy(row_map) - : (const int*)row_map.data(); - const int* ent = (const int*)entries.data(); - const scalar_type* vals = values.data(); + : (const int *)row_map.data(); + const int *ent = (const int *)entries.data(); + const scalar_type *vals = values.data(); if (std::is_same::value) { cusparseDcsrsv2_bufferSize(h->handle, h->transpose, nrows, nnz, h->descr, - (double*)vals, (int*)rm, (int*)ent, h->info, + (double *)vals, (int *)rm, (int *)ent, h->info, &pBufferSize); // pBuffer returned by cudaMalloc is automatically aligned to 128 bytes. cudaError_t my_error; - my_error = cudaMalloc((void**)&(h->pBuffer), pBufferSize); + my_error = cudaMalloc((void **)&(h->pBuffer), pBufferSize); if (cudaSuccess != my_error) std::cout << "cudmalloc pBuffer error_t error name " << cudaGetErrorString(my_error) << std::endl; status = cusparseDcsrsv2_analysis( - h->handle, h->transpose, nrows, nnz, h->descr, (double*)vals, - (int*)rm, (int*)ent, h->info, h->policy, h->pBuffer); + h->handle, h->transpose, nrows, nnz, h->descr, (double *)vals, + (int *)rm, (int *)ent, h->info, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "analysis status error name " << (status) << std::endl; } else if (std::is_same::value) { cusparseScsrsv2_bufferSize(h->handle, h->transpose, nrows, nnz, h->descr, - (float*)vals, (int*)rm, (int*)ent, h->info, + (float *)vals, (int *)rm, (int *)ent, h->info, &pBufferSize); // pBuffer returned by cudaMalloc is automatically aligned to 128 bytes. cudaError_t my_error; - my_error = cudaMalloc((void**)&(h->pBuffer), pBufferSize); + my_error = cudaMalloc((void **)&(h->pBuffer), pBufferSize); if (cudaSuccess != my_error) std::cout << "cudmalloc pBuffer error_t error name " << cudaGetErrorString(my_error) << std::endl; status = cusparseScsrsv2_analysis( - h->handle, h->transpose, nrows, nnz, h->descr, (float*)vals, (int*)rm, - (int*)ent, h->info, h->policy, h->pBuffer); + h->handle, h->transpose, nrows, nnz, h->descr, (float *)vals, + (int *)rm, (int *)ent, h->info, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "analysis status error name " << (status) << std::endl; } else if (std::is_same >::value) { cusparseZcsrsv2_bufferSize(h->handle, h->transpose, nrows, nnz, h->descr, - (cuDoubleComplex*)vals, (int*)rm, (int*)ent, + (cuDoubleComplex *)vals, (int *)rm, (int *)ent, h->info, &pBufferSize); // pBuffer returned by cudaMalloc is automatically aligned to 128 bytes. cudaError_t my_error; - my_error = cudaMalloc((void**)&(h->pBuffer), pBufferSize); + my_error = cudaMalloc((void **)&(h->pBuffer), pBufferSize); if (cudaSuccess != my_error) std::cout << "cudmalloc pBuffer error_t error name " << cudaGetErrorString(my_error) << std::endl; - status = cusparseZcsrsv2_analysis( - h->handle, h->transpose, nrows, nnz, h->descr, (cuDoubleComplex*)vals, - (int*)rm, (int*)ent, h->info, h->policy, h->pBuffer); + status = + cusparseZcsrsv2_analysis(h->handle, h->transpose, nrows, nnz, + h->descr, (cuDoubleComplex *)vals, (int *)rm, + (int *)ent, h->info, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "analysis status error name " << (status) << std::endl; } else if (std::is_same >::value) { cusparseCcsrsv2_bufferSize(h->handle, h->transpose, nrows, nnz, h->descr, - (cuComplex*)vals, (int*)rm, (int*)ent, h->info, - &pBufferSize); + (cuComplex *)vals, (int *)rm, (int *)ent, + h->info, &pBufferSize); // pBuffer returned by cudaMalloc is automatically aligned to 128 bytes. cudaError_t my_error; - my_error = cudaMalloc((void**)&(h->pBuffer), pBufferSize); + my_error = cudaMalloc((void **)&(h->pBuffer), pBufferSize); if (cudaSuccess != my_error) std::cout << "cudmalloc pBuffer error_t error name " << cudaGetErrorString(my_error) << std::endl; status = cusparseCcsrsv2_analysis( - h->handle, h->transpose, nrows, nnz, h->descr, (cuComplex*)vals, - (int*)rm, (int*)ent, h->info, h->policy, h->pBuffer); + h->handle, h->transpose, nrows, nnz, h->descr, (cuComplex *)vals, + (int *)rm, (int *)ent, h->info, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "analysis status error name " << (status) << std::endl; @@ -269,6 +277,7 @@ void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, } #endif #else + (void)space; (void)sptrsv_handle; (void)nrows; (void)row_map; @@ -281,10 +290,11 @@ void sptrsvcuSPARSE_symbolic(KernelHandle *sptrsv_handle, } template < - typename KernelHandle, typename ain_row_index_view_type, - typename ain_nonzero_index_view_type, typename ain_values_scalar_view_type, - typename b_values_scalar_view_type, typename x_values_scalar_view_type> -void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, + typename ExecutionSpace, typename KernelHandle, + typename ain_row_index_view_type, typename ain_nonzero_index_view_type, + typename ain_values_scalar_view_type, typename b_values_scalar_view_type, + typename x_values_scalar_view_type> +void sptrsvcuSPARSE_solve(ExecutionSpace &space, KernelHandle *sptrsv_handle, typename KernelHandle::nnz_lno_t nrows, ain_row_index_view_type row_map, ain_nonzero_index_view_type entries, @@ -323,6 +333,9 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, typename KernelHandle::SPTRSVcuSparseHandleType *h = sptrsv_handle->get_cuSparseHandle(); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetStream(h->handle, space.cuda_stream())); + const scalar_type alpha = scalar_type(1.0); cudaDataType cudaValueType = cuda_data_type_from(); @@ -354,18 +367,23 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, if (std::is_same::value) { cusparseStatus_t status; - typename KernelHandle::SPTRSVcuSparseHandleType* h = + typename KernelHandle::SPTRSVcuSparseHandleType *h = sptrsv_handle->get_cuSparseHandle(); + if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetStream(h->handle, space.cuda_stream())); + } + int nnz = entries.extent_int(0); - const int* rm = !std::is_same::value + const int *rm = !std::is_same::value ? sptrsv_handle->get_int_rowmap_ptr() - : (const int*)row_map.data(); - const int* ent = (const int*)entries.data(); - const scalar_type* vals = values.data(); - const scalar_type* bv = rhs.data(); - scalar_type* xv = lhs.data(); + : (const int *)row_map.data(); + const int *ent = (const int *)entries.data(); + const scalar_type *vals = values.data(); + const scalar_type *bv = rhs.data(); + scalar_type *xv = lhs.data(); if (std::is_same::value) { if (h->pBuffer == nullptr) { @@ -373,10 +391,10 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, } const double alpha = double(1); - status = cusparseDcsrsv2_solve(h->handle, h->transpose, nrows, nnz, - &alpha, h->descr, (double*)vals, (int*)rm, - (int*)ent, h->info, (double*)bv, - (double*)xv, h->policy, h->pBuffer); + status = cusparseDcsrsv2_solve( + h->handle, h->transpose, nrows, nnz, &alpha, h->descr, (double *)vals, + (int *)rm, (int *)ent, h->info, (double *)bv, (double *)xv, h->policy, + h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "solve status error name " << (status) << std::endl; @@ -387,9 +405,9 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, const float alpha = float(1); status = cusparseScsrsv2_solve(h->handle, h->transpose, nrows, nnz, - &alpha, h->descr, (float*)vals, (int*)rm, - (int*)ent, h->info, (float*)bv, (float*)xv, - h->policy, h->pBuffer); + &alpha, h->descr, (float *)vals, (int *)rm, + (int *)ent, h->info, (float *)bv, + (float *)xv, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "solve status error name " << (status) << std::endl; @@ -399,8 +417,8 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, cualpha.y = 0.0; status = cusparseZcsrsv2_solve( h->handle, h->transpose, nrows, nnz, &cualpha, h->descr, - (cuDoubleComplex*)vals, (int*)rm, (int*)ent, h->info, - (cuDoubleComplex*)bv, (cuDoubleComplex*)xv, h->policy, h->pBuffer); + (cuDoubleComplex *)vals, (int *)rm, (int *)ent, h->info, + (cuDoubleComplex *)bv, (cuDoubleComplex *)xv, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "solve status error name " << (status) << std::endl; @@ -410,8 +428,8 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, cualpha.y = 0.0; status = cusparseCcsrsv2_solve( h->handle, h->transpose, nrows, nnz, &cualpha, h->descr, - (cuComplex*)vals, (int*)rm, (int*)ent, h->info, (cuComplex*)bv, - (cuComplex*)xv, h->policy, h->pBuffer); + (cuComplex *)vals, (int *)rm, (int *)ent, h->info, (cuComplex *)bv, + (cuComplex *)xv, h->policy, h->pBuffer); if (CUSPARSE_STATUS_SUCCESS != status) std::cout << "solve status error name " << (status) << std::endl; @@ -425,6 +443,7 @@ void sptrsvcuSPARSE_solve(KernelHandle *sptrsv_handle, } #endif #else + (void)space; (void)sptrsv_handle; (void)nrows; (void)row_map; @@ -539,13 +558,13 @@ void sptrsvcuSPARSE_solve_streams( "CUSPARSE requires local ordinals to be integer.\n"); } else { const scalar_type alpha = scalar_type(1.0); - std::vector sptrsv_handle_v(nstreams); - std::vector h_v(nstreams); - std::vector rm_v(nstreams); - std::vector ent_v(nstreams); - std::vector vals_v(nstreams); - std::vector bv_v(nstreams); - std::vector xv_v(nstreams); + std::vector sptrsv_handle_v(nstreams); + std::vector h_v(nstreams); + std::vector rm_v(nstreams); + std::vector ent_v(nstreams); + std::vector vals_v(nstreams); + std::vector bv_v(nstreams); + std::vector xv_v(nstreams); for (int i = 0; i < nstreams; i++) { sptrsv_handle_v[i] = handle_v[i].get_sptrsv_handle(); @@ -560,8 +579,8 @@ void sptrsvcuSPARSE_solve_streams( } rm_v[i] = !std::is_same::value ? sptrsv_handle_v[i]->get_int_rowmap_ptr() - : reinterpret_cast(row_map_v[i].data()); - ent_v[i] = reinterpret_cast(entries_v[i].data()); + : reinterpret_cast(row_map_v[i].data()); + ent_v[i] = reinterpret_cast(entries_v[i].data()); vals_v[i] = values_v[i].data(); bv_v[i] = rhs_v[i].data(); xv_v[i] = lhs_v[i].data(); @@ -573,42 +592,42 @@ void sptrsvcuSPARSE_solve_streams( if (std::is_same::value) { KOKKOS_CUSPARSE_SAFE_CALL(cusparseDcsrsv2_solve( h_v[i]->handle, h_v[i]->transpose, nrows, nnz, - reinterpret_cast(&alpha), h_v[i]->descr, - reinterpret_cast(vals_v[i]), - reinterpret_cast(rm_v[i]), - reinterpret_cast(ent_v[i]), h_v[i]->info, - reinterpret_cast(bv_v[i]), - reinterpret_cast(xv_v[i]), h_v[i]->policy, + reinterpret_cast(&alpha), h_v[i]->descr, + reinterpret_cast(vals_v[i]), + reinterpret_cast(rm_v[i]), + reinterpret_cast(ent_v[i]), h_v[i]->info, + reinterpret_cast(bv_v[i]), + reinterpret_cast(xv_v[i]), h_v[i]->policy, h_v[i]->pBuffer)); } else if (std::is_same::value) { KOKKOS_CUSPARSE_SAFE_CALL(cusparseScsrsv2_solve( h_v[i]->handle, h_v[i]->transpose, nrows, nnz, - reinterpret_cast(&alpha), h_v[i]->descr, - reinterpret_cast(vals_v[i]), - reinterpret_cast(rm_v[i]), - reinterpret_cast(ent_v[i]), h_v[i]->info, - reinterpret_cast(bv_v[i]), - reinterpret_cast(xv_v[i]), h_v[i]->policy, + reinterpret_cast(&alpha), h_v[i]->descr, + reinterpret_cast(vals_v[i]), + reinterpret_cast(rm_v[i]), + reinterpret_cast(ent_v[i]), h_v[i]->info, + reinterpret_cast(bv_v[i]), + reinterpret_cast(xv_v[i]), h_v[i]->policy, h_v[i]->pBuffer)); } else if (std::is_same >::value) { KOKKOS_CUSPARSE_SAFE_CALL(cusparseZcsrsv2_solve( h_v[i]->handle, h_v[i]->transpose, nrows, nnz, - reinterpret_cast(&alpha), h_v[i]->descr, - reinterpret_cast(vals_v[i]), - reinterpret_cast(rm_v[i]), - reinterpret_cast(ent_v[i]), h_v[i]->info, - reinterpret_cast(bv_v[i]), - reinterpret_cast(xv_v[i]), h_v[i]->policy, + reinterpret_cast(&alpha), h_v[i]->descr, + reinterpret_cast(vals_v[i]), + reinterpret_cast(rm_v[i]), + reinterpret_cast(ent_v[i]), h_v[i]->info, + reinterpret_cast(bv_v[i]), + reinterpret_cast(xv_v[i]), h_v[i]->policy, h_v[i]->pBuffer)); } else if (std::is_same >::value) { KOKKOS_CUSPARSE_SAFE_CALL(cusparseCcsrsv2_solve( h_v[i]->handle, h_v[i]->transpose, nrows, nnz, - reinterpret_cast(&alpha), h_v[i]->descr, - reinterpret_cast(vals_v[i]), - reinterpret_cast(rm_v[i]), - reinterpret_cast(ent_v[i]), h_v[i]->info, - reinterpret_cast(bv_v[i]), - reinterpret_cast(xv_v[i]), h_v[i]->policy, + reinterpret_cast(&alpha), h_v[i]->descr, + reinterpret_cast(vals_v[i]), + reinterpret_cast(rm_v[i]), + reinterpret_cast(ent_v[i]), h_v[i]->info, + reinterpret_cast(bv_v[i]), + reinterpret_cast(xv_v[i]), h_v[i]->policy, h_v[i]->pBuffer)); } else { throw std::runtime_error("CUSPARSE wrapper error: unsupported type.\n"); diff --git a/sparse/impl/KokkosSparse_sptrsv_solve_impl.hpp b/sparse/impl/KokkosSparse_sptrsv_solve_impl.hpp index ee7e83b554..a64a4d23bc 100644 --- a/sparse/impl/KokkosSparse_sptrsv_solve_impl.hpp +++ b/sparse/impl/KokkosSparse_sptrsv_solve_impl.hpp @@ -664,8 +664,6 @@ struct LowerTriLvlSchedTP2SolverFunctor { // Helper functors for Lower-triangular solve with SpMV template struct SparseTriSupernodalSpMVFunctor { - // using execution_space = typename LHSType::execution_space; - // using memory_space = typename execution_space::memory_space; using execution_space = typename TriSolveHandle::HandleExecSpace; using memory_space = typename TriSolveHandle::HandleTempMemorySpace; @@ -2891,16 +2889,15 @@ void upper_tri_solve_cg(TriSolveHandle &thandle, const RowMapType row_map, #endif -template -void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, - const EntriesType entries, const ValuesType values, - const RHSType &rhs, LHSType &lhs) { +template +void lower_tri_solve(ExecutionSpace &space, TriSolveHandle &thandle, + const RowMapType row_map, const EntriesType entries, + const ValuesType values, const RHSType &rhs, + LHSType &lhs) { #if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOSPSTRSV_SOLVE_IMPL_PROFILE) cudaProfilerStop(); #endif - - typedef typename TriSolveHandle::execution_space execution_space; typedef typename TriSolveHandle::size_type size_type; typedef typename TriSolveHandle::nnz_lno_view_t NGBLType; @@ -2914,7 +2911,8 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, #if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) using namespace KokkosSparse::Experimental; - using memory_space = typename TriSolveHandle::memory_space; + using memory_space = typename TriSolveHandle::HandleTempMemorySpace; + using device_t = Kokkos::Device; using integer_view_t = typename TriSolveHandle::integer_view_t; using integer_view_host_t = typename TriSolveHandle::integer_view_host_t; using scalar_t = typename ValuesType::non_const_value_type; @@ -2981,8 +2979,10 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, KokkosSparse::Experimental::SPTRSVAlgorithm::SEQLVLSCHD_RP) { Kokkos::parallel_for( "parfor_fixed_lvl", - Kokkos::RangePolicy(node_count, - node_count + lvl_nodes), + Kokkos::Experimental::require( + Kokkos::RangePolicy(space, node_count, + node_count + lvl_nodes), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), LowerTriLvlSchedRPSolverFunctor( @@ -2990,8 +2990,8 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, } else if (thandle.get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm:: SEQLVLSCHD_TP1) { - typedef Kokkos::TeamPolicy policy_type; - int team_size = thandle.get_team_size(); + using team_policy_t = Kokkos::TeamPolicy; + int team_size = thandle.get_team_size(); #ifdef KOKKOSKERNELS_SPTRSV_TRILVLSCHED TriLvlSchedTP1SolverFunctor; + using team_policy_type = Kokkos::TeamPolicy; using supernode_view_type = - Kokkos::View; if (diag_kernel_type_host(lvl) == 3) { // using device-level kernels (functor is called to scatter the @@ -3079,9 +3087,12 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-2, node_count, nodes_grouped_by_level, supercols, work_offset_data, lhs, work); - Kokkos::parallel_for("parfor_tri_supernode_spmv", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_tri_supernode_spmv", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } for (size_type league_rank = 0; league_rank < lvl_nodes; @@ -3118,7 +3129,7 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, auto Ljj = Kokkos::subview( viewL, range_type(0, nsrow), Kokkos::ALL()); // s-th supernocal column of L - KokkosBlas::gemv("N", one, Ljj, Xj, zero, Y); + KokkosBlas::gemv(space, "N", one, Ljj, Xj, zero, Y); } else { auto Xj = Kokkos::subview( lhs, @@ -3131,15 +3142,17 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, if (invert_diagonal) { auto Y = Kokkos::subview( work, range_type(workoffset, workoffset + nscol)); - KokkosBlas::gemv("N", one, Ljj, Y, zero, Xj); + KokkosBlas::gemv(space, "N", one, Ljj, Y, zero, Xj); } else { char unit_diag = (unit_diagonal ? 'U' : 'N'); // NOTE: we currently supports only default_layout = // LayoutLeft - Kokkos::View Xjj(Xj.data(), nscol, 1); - KokkosBlas::trsm("L", "L", "N", &unit_diag, one, Ljj, Xjj); + KokkosBlas::trsm(space, "L", "L", "N", &unit_diag, one, Ljj, + Xjj); + // TODO: space.fence(); Kokkos::fence(); } // update off-diagonal blocks @@ -3155,7 +3168,7 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, viewL, range_type(nscol, nsrow), Kokkos::ALL()); // off-diagonal blocks of s-th supernodal // column of L - KokkosBlas::gemv("N", one, Lij, Xj, zero, Z); + KokkosBlas::gemv(space, "N", one, Lij, Xj, zero, Z); } } } @@ -3165,9 +3178,12 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-1, node_count, nodes_grouped_by_level, supercols, work_offset_data, lhs, work); - Kokkos::parallel_for("parfor_tri_supernode_spmv", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_tri_supernode_spmv", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } } @@ -3178,9 +3194,12 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, supercols, row_map, entries, values, lvl, kernel_type, diag_kernel_type, lhs, work, work_offset, nodes_grouped_by_level, node_count); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_functor); #ifdef profile_supernodal_etree Kokkos::fence(); @@ -3200,7 +3219,7 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, #endif // initialize input & output vectors - using team_policy_type = Kokkos::TeamPolicy; + using team_policy_type = Kokkos::TeamPolicy; // update with spmv (one or two SpMV) bool transpose_spmv = @@ -3210,36 +3229,45 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, if (!invert_offdiagonal) { // solve with diagonals auto digmat = thandle.get_diagblock(lvl); - KokkosSparse::spmv(tran, one, digmat, lhs, one, work); + KokkosSparse::spmv(space, tran, one, digmat, lhs, one, work); // copy from work to lhs corresponding to diagonal blocks SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-1, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } else { // copy lhs corresponding to diagonal blocks to work and zero out in // lhs SparseTriSupernodalSpMVFunctor sptrsv_init_functor(1, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } // update off-diagonals (potentiall combined with solve with // diagonals) auto submat = thandle.get_submatrix(lvl); - KokkosSparse::spmv(tran, one, submat, work, one, lhs); + KokkosSparse::spmv(space, tran, one, submat, work, one, lhs); // reinitialize workspace SparseTriSupernodalSpMVFunctor sptrsv_finalize_functor(0, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_finalize_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_finalize_functor); #ifdef profile_supernodal_etree Kokkos::fence(); @@ -3272,16 +3300,17 @@ void lower_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, } // end lower_tri_solve -template -void upper_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, - const EntriesType entries, const ValuesType values, - const RHSType &rhs, LHSType &lhs) { +template +void upper_tri_solve(ExecutionSpace &space, TriSolveHandle &thandle, + const RowMapType row_map, const EntriesType entries, + const ValuesType values, const RHSType &rhs, + LHSType &lhs) { #if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOSPSTRSV_SOLVE_IMPL_PROFILE) cudaProfilerStop(); #endif - typedef typename TriSolveHandle::execution_space execution_space; - + using memory_space = typename TriSolveHandle::HandleTempMemorySpace; + using device_t = Kokkos::Device; typedef typename TriSolveHandle::size_type size_type; typedef typename TriSolveHandle::nnz_lno_view_t NGBLType; @@ -3298,7 +3327,6 @@ void upper_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, #if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) using namespace KokkosSparse::Experimental; - using memory_space = typename TriSolveHandle::memory_space; using integer_view_t = typename TriSolveHandle::integer_view_t; using integer_view_host_t = typename TriSolveHandle::integer_view_host_t; using scalar_t = typename ValuesType::non_const_value_type; @@ -3365,14 +3393,16 @@ void upper_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, KokkosSparse::Experimental::SPTRSVAlgorithm::SEQLVLSCHD_RP) { Kokkos::parallel_for( "parfor_fixed_lvl", - Kokkos::RangePolicy(node_count, - node_count + lvl_nodes), + Kokkos::Experimental::require( + Kokkos::RangePolicy(space, node_count, + node_count + lvl_nodes), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), UpperTriLvlSchedRPSolverFunctor( row_map, entries, values, lhs, rhs, nodes_grouped_by_level)); } else if (thandle.get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm::SEQLVLSCHD_TP1) { - typedef Kokkos::TeamPolicy policy_type; + using team_policy_t = Kokkos::TeamPolicy; int team_size = thandle.get_team_size(); @@ -3388,11 +3418,19 @@ void upper_tri_solve(TriSolveHandle &thandle, const RowMapType row_map, node_count); #endif if (team_size == -1) - Kokkos::parallel_for("parfor_u_team", - policy_type(lvl_nodes, Kokkos::AUTO), tstf); + Kokkos::parallel_for( + "parfor_u_team", + Kokkos::Experimental::require( + team_policy_t(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + tstf); else - Kokkos::parallel_for("parfor_u_team", - policy_type(lvl_nodes, team_size), tstf); + Kokkos::parallel_for( + "parfor_u_team", + Kokkos::Experimental::require( + team_policy_t(space, lvl_nodes, team_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + tstf); } // TP2 algorithm has issues with some offset-ordinal combo to be addressed /* @@ -3444,7 +3482,7 @@ tstf); } // end elseif timer.reset(); #endif - using team_policy_type = Kokkos::TeamPolicy; + using team_policy_type = Kokkos::TeamPolicy; if (thandle.is_column_major()) { // U stored in CSC if (diag_kernel_type_host(lvl) == 3) { // using device-level kernels (functor is called to gather the input @@ -3457,9 +3495,12 @@ tstf); } // end elseif SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-2, node_count, nodes_grouped_by_level, supercols, work_offset_data, lhs, work); - Kokkos::parallel_for("parfor_tri_supernode_spmv", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_tri_supernode_spmv", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } for (size_type league_rank = 0; league_rank < lvl_nodes; league_rank++) { @@ -3486,7 +3527,7 @@ tstf); } // end elseif // create a view for the s-th supernocal block column // NOTE: we currently supports only default_layout = LayoutLeft - Kokkos::View viewU(&dataU[i1], nsrow, nscol); @@ -3500,7 +3541,7 @@ tstf); } // end elseif workoffset, workoffset + nsrow)); // needed with gemv for update&scatter - KokkosBlas::gemv("N", one, Uij, Xj, zero, Z); + KokkosBlas::gemv(space, "N", one, Uij, Xj, zero, Z); } else { // extract part of the solution, corresponding to the diagonal // block @@ -3517,14 +3558,14 @@ tstf); } // end elseif workoffset, workoffset + nscol)); // needed for gemv instead of trmv/trsv - KokkosBlas::gemv("N", one, Ujj, Y, zero, Xj); + KokkosBlas::gemv(space, "N", one, Ujj, Y, zero, Xj); } else { // NOTE: we currently supports only default_layout = // LayoutLeft - Kokkos::View Xjj(Xj.data(), nscol, 1); - KokkosBlas::trsm("L", "U", "N", "N", one, Ujj, Xjj); + KokkosBlas::trsm(space, "L", "U", "N", "N", one, Ujj, Xjj); } // update off-diagonal blocks if (nsrow2 > 0) { @@ -3538,7 +3579,7 @@ tstf); } // end elseif workoffset + nscol, workoffset + nscol + nsrow2)); // needed with gemv for update&scatter - KokkosBlas::gemv("N", one, Uij, Xj, zero, Z); + KokkosBlas::gemv(space, "N", one, Uij, Xj, zero, Z); } } } @@ -3548,9 +3589,12 @@ tstf); } // end elseif SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-1, node_count, nodes_grouped_by_level, supercols, work_offset_data, lhs, work); - Kokkos::parallel_for("parfor_tri_supernode_spmv", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_tri_supernode_spmv", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } } @@ -3562,10 +3606,13 @@ tstf); } // end elseif diag_kernel_type, lhs, work, work_offset, nodes_grouped_by_level, node_count); - using policy_type = Kokkos::TeamPolicy; - Kokkos::parallel_for("parfor_usolve_tran_supernode", - policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_functor); + using team_policy_t = Kokkos::TeamPolicy; + Kokkos::parallel_for( + "parfor_usolve_tran_supernode", + Kokkos::Experimental::require( + team_policy_t(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_functor); } else { // U stored in CSR // launching sparse-triangular solve functor UpperTriSupernodalFunctor; - Kokkos::parallel_for("parfor_usolve_supernode", - policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_functor); + using team_policy_t = Kokkos::TeamPolicy; + Kokkos::parallel_for( + "parfor_usolve_supernode", + Kokkos::Experimental::require( + team_policy_t(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_functor); if (diag_kernel_type_host(lvl) == 3) { // using device-level kernels (functor is called to gather the input @@ -3608,7 +3658,7 @@ tstf); } // end elseif // create a view for the s-th supernocal block column // NOTE: we currently supports only default_layout = LayoutLeft - Kokkos::View viewU(&dataU[i1], nsrow, nscol); @@ -3634,7 +3684,7 @@ tstf); } // end elseif workoffset + nscol, workoffset + nscol + nsrow2)); // needed with gemv for update&scatter - KokkosBlas::gemv("T", -one, Uij, Z, one, Xj); + KokkosBlas::gemv(space, "T", -one, Uij, Z, one, Xj); } // "triangular-solve" to compute Xj @@ -3642,13 +3692,13 @@ tstf); } // end elseif auto Ujj = Kokkos::subview(viewU, range_type(0, nscol), Kokkos::ALL()); if (invert_diagonal) { - KokkosBlas::gemv("T", one, Ujj, Xj, zero, Y); + KokkosBlas::gemv(space, "T", one, Ujj, Xj, zero, Y); } else { // NOTE: we currently supports only default_layout = LayoutLeft - Kokkos::View Xjj(Xj.data(), nscol, 1); - KokkosBlas::trsm("L", "L", "T", "N", one, Ujj, Xjj); + KokkosBlas::trsm(space, "L", "L", "T", "N", one, Ujj, Xjj); } } if (invert_diagonal) { @@ -3657,9 +3707,12 @@ tstf); } // end elseif SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-1, node_count, nodes_grouped_by_level, supercols, work_offset_data, lhs, work); - Kokkos::parallel_for("parfor_tri_supernode_spmv", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_tri_supernode_spmv", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } } } @@ -3680,7 +3733,7 @@ tstf); } // end elseif #endif // initialize input & output vectors - using team_policy_type = Kokkos::TeamPolicy; + using team_policy_type = Kokkos::TeamPolicy; // update with one, or two, spmv bool transpose_spmv = @@ -3691,28 +3744,34 @@ tstf); } // end elseif if (!invert_offdiagonal) { // solve with diagonals auto digmat = thandle.get_diagblock(lvl); - KokkosSparse::spmv(tran, one, digmat, lhs, one, work); + KokkosSparse::spmv(space, tran, one, digmat, lhs, one, work); // copy from work to lhs corresponding to diagonal blocks SparseTriSupernodalSpMVFunctor sptrsv_init_functor(-1, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } else { // zero out lhs corresponding to diagonal blocks in lhs, and copy to // work SparseTriSupernodalSpMVFunctor sptrsv_init_functor(1, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); } // update with off-diagonals (potentiall combined with diagonal // solves) auto submat = thandle.get_submatrix(lvl); - KokkosSparse::spmv(tran, one, submat, work, one, lhs); + KokkosSparse::spmv(space, tran, one, submat, work, one, lhs); } else { if (!invert_offdiagonal) { // zero out lhs corresponding to diagonal blocks in lhs, and copy to @@ -3720,17 +3779,20 @@ tstf); } // end elseif SparseTriSupernodalSpMVFunctor sptrsv_init_functor(1, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_init_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_init_functor); // update with off-diagonals auto submat = thandle.get_submatrix(lvl); - KokkosSparse::spmv(tran, one, submat, lhs, one, work); + KokkosSparse::spmv(space, tran, one, submat, lhs, one, work); // solve with diagonals auto digmat = thandle.get_diagblock(lvl); - KokkosSparse::spmv(tran, one, digmat, work, one, lhs); + KokkosSparse::spmv(space, tran, one, digmat, work, one, lhs); } else { std::cout << " ** invert_offdiag with U in CSR not supported **" << std::endl; @@ -3740,9 +3802,12 @@ tstf); } // end elseif SparseTriSupernodalSpMVFunctor sptrsv_finalize_functor(0, node_count, nodes_grouped_by_level, supercols, supercols, lhs, work); - Kokkos::parallel_for("parfor_lsolve_supernode", - team_policy_type(lvl_nodes, Kokkos::AUTO), - sptrsv_finalize_functor); + Kokkos::parallel_for( + "parfor_lsolve_supernode", + Kokkos::Experimental::require( + team_policy_type(space, lvl_nodes, Kokkos::AUTO), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + sptrsv_finalize_functor); #ifdef profile_supernodal_etree Kokkos::fence(); @@ -3765,23 +3830,22 @@ tstf); } // end elseif double sptrsv_time_seconds = sptrsv_timer.seconds(); std::cout << " + SpTrsv(uppper) time: " << sptrsv_time_seconds << std::endl << std::endl; - std::cout << " + Execution space : " << execution_space::name() + std::cout << " + Execution space : " << ExecutionSpace::name() << std::endl; std::cout << " + Memory space : " << memory_space::name() << std::endl; #endif } // end upper_tri_solve -template -void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, - const EntriesType entries, const ValuesType values, - const RHSType &rhs, LHSType &lhs, +template +void tri_solve_chain(ExecutionSpace &space, TriSolveHandle &thandle, + const RowMapType row_map, const EntriesType entries, + const ValuesType values, const RHSType &rhs, LHSType &lhs, const bool /*is_lowertri_*/) { #if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOSPSTRSV_SOLVE_IMPL_PROFILE) cudaProfilerStop(); #endif - typedef typename TriSolveHandle::execution_space execution_space; typedef typename TriSolveHandle::size_type size_type; typedef typename TriSolveHandle::nnz_lno_view_t NGBLType; @@ -3802,9 +3866,9 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, size_type node_count = 0; // REFACTORED to cleanup; next, need debug and timer routines - using policy_type = Kokkos::TeamPolicy; + using policy_type = Kokkos::TeamPolicy; using large_cutoff_policy_type = - Kokkos::TeamPolicy; + Kokkos::TeamPolicy; /* using TP1Functor = TriLvlSchedTP1SolverFunctor; using LTP1Functor = @@ -3865,14 +3929,17 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, #endif if (team_size == -1) { team_size = - policy_type(1, 1, vector_size) + policy_type(space, 1, 1, vector_size) .team_size_recommended(tstf, Kokkos::ParallelForTag()); } size_type lvl_nodes = hnodes_per_level(schain); // lvl == echain???? - Kokkos::parallel_for("parfor_l_team_chain1", - policy_type(lvl_nodes, team_size, vector_size), - tstf); + Kokkos::parallel_for( + "parfor_l_team_chain1", + Kokkos::Experimental::require( + policy_type(space, lvl_nodes, team_size, vector_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + tstf); node_count += lvl_nodes; } else { @@ -3884,7 +3951,7 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, if (team_size_singleblock <= 0) { team_size_singleblock = - policy_type(1, 1, vector_size) + policy_type(space, 1, 1, vector_size) .team_size_recommended( SingleBlockFunctor(row_map, entries, values, lhs, rhs, nodes_grouped_by_level, @@ -3907,7 +3974,10 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, #endif Kokkos::parallel_for( "parfor_l_team_chainmulti", - policy_type(1, team_size_singleblock, vector_size), tstf); + Kokkos::Experimental::require( + policy_type(space, 1, team_size_singleblock, vector_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + tstf); } else { // team_size_singleblock < cutoff => kernel must allow for a // block-stride internally @@ -3925,11 +3995,15 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, #endif Kokkos::parallel_for( "parfor_l_team_chainmulti_cutoff", - large_cutoff_policy_type(1, team_size_singleblock, vector_size), + Kokkos::Experimental::require( + large_cutoff_policy_type(1, team_size_singleblock, + vector_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), tstf); } node_count += lvl_nodes; } + // TODO: space.fence() Kokkos::fence(); // TODO - is this necessary? that is, can the // parallel_for launch before the s/echain values have // been updated? @@ -3955,16 +4029,19 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, #endif if (team_size == -1) { team_size = - policy_type(1, 1, vector_size) + policy_type(space, 1, 1, vector_size) .team_size_recommended(tstf, Kokkos::ParallelForTag()); } // TODO To use cudagraph here, need to know how many non-unit chains // there are, create a graph for each and launch accordingly size_type lvl_nodes = hnodes_per_level(schain); // lvl == echain???? - Kokkos::parallel_for("parfor_u_team_chain1", - policy_type(lvl_nodes, team_size, vector_size), - tstf); + Kokkos::parallel_for( + "parfor_u_team_chain1", + Kokkos::Experimental::require( + policy_type(space, lvl_nodes, team_size, vector_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + tstf); node_count += lvl_nodes; } else { @@ -3980,7 +4057,7 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, // values, lhs, rhs, nodes_grouped_by_level, is_lowertri, node_count), // Kokkos::ParallelForTag()); team_size_singleblock = - policy_type(1, 1, vector_size) + policy_type(space, 1, 1, vector_size) .team_size_recommended( SingleBlockFunctor(row_map, entries, values, lhs, rhs, nodes_grouped_by_level, @@ -4003,7 +4080,10 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, #endif Kokkos::parallel_for( "parfor_u_team_chainmulti", - policy_type(1, team_size_singleblock, vector_size), tstf); + Kokkos::Experimental::require( + policy_type(space, 1, team_size_singleblock, vector_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), + tstf); } else { // team_size_singleblock < cutoff => kernel must allow for a // block-stride internally @@ -4021,11 +4101,15 @@ void tri_solve_chain(TriSolveHandle &thandle, const RowMapType row_map, #endif Kokkos::parallel_for( "parfor_u_team_chainmulti_cutoff", - large_cutoff_policy_type(1, team_size_singleblock, vector_size), + Kokkos::Experimental::require( + large_cutoff_policy_type(1, team_size_singleblock, + vector_size), + Kokkos::Experimental::WorkItemProperty::HintLightWeight), tstf); } node_count += lvl_nodes; } + // TODO: space.fence() Kokkos::fence(); // TODO - is this necessary? that is, can the // parallel_for launch before the s/echain values have // been updated? diff --git a/sparse/impl/KokkosSparse_sptrsv_solve_spec.hpp b/sparse/impl/KokkosSparse_sptrsv_solve_spec.hpp index e36b9df236..6ad321c286 100644 --- a/sparse/impl/KokkosSparse_sptrsv_solve_spec.hpp +++ b/sparse/impl/KokkosSparse_sptrsv_solve_spec.hpp @@ -96,9 +96,9 @@ template ::value> struct SPTRSV_SOLVE { - static void sptrsv_solve(KernelHandle *handle, const RowMapType row_map, - const EntriesType entries, const ValuesType values, - BType b, XType x); + static void sptrsv_solve(ExecutionSpace &space, KernelHandle *handle, + const RowMapType row_map, const EntriesType entries, + const ValuesType values, BType b, XType x); static void sptrsv_solve_streams( const std::vector &execspace_v, @@ -117,9 +117,9 @@ template { - static void sptrsv_solve(KernelHandle *handle, const RowMapType row_map, - const EntriesType entries, const ValuesType values, - BType b, XType x) { + static void sptrsv_solve(ExecutionSpace &space, KernelHandle *handle, + const RowMapType row_map, const EntriesType entries, + const ValuesType values, BType b, XType x) { // Call specific algorithm type auto sptrsv_handle = handle->get_sptrsv_handle(); Kokkos::Profiling::pushRegion(sptrsv_handle->is_lower_tri() @@ -127,40 +127,44 @@ struct SPTRSV_SOLVEis_lower_tri()) { if (sptrsv_handle->is_symbolic_complete() == false) { - Experimental::lower_tri_symbolic(*sptrsv_handle, row_map, entries); + Experimental::lower_tri_symbolic(space, *sptrsv_handle, row_map, + entries); } if (sptrsv_handle->get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm::SEQLVLSCHD_TP1CHAIN) { - Experimental::tri_solve_chain(*sptrsv_handle, row_map, entries, values, - b, x, true); + Experimental::tri_solve_chain(space, *sptrsv_handle, row_map, entries, + values, b, x, true); } else { #ifdef KOKKOSKERNELS_SPTRSV_CUDAGRAPHSUPPORT using ExecSpace = typename RowMapType::memory_space::execution_space; if (std::is_same::value) + // TODO: set stream in thandle's sptrsvCudaGraph Experimental::lower_tri_solve_cg(*sptrsv_handle, row_map, entries, values, b, x); else #endif - Experimental::lower_tri_solve(*sptrsv_handle, row_map, entries, + Experimental::lower_tri_solve(space, *sptrsv_handle, row_map, entries, values, b, x); } } else { if (sptrsv_handle->is_symbolic_complete() == false) { - Experimental::upper_tri_symbolic(*sptrsv_handle, row_map, entries); + Experimental::upper_tri_symbolic(space, *sptrsv_handle, row_map, + entries); } if (sptrsv_handle->get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm::SEQLVLSCHD_TP1CHAIN) { - Experimental::tri_solve_chain(*sptrsv_handle, row_map, entries, values, - b, x, false); + Experimental::tri_solve_chain(space, *sptrsv_handle, row_map, entries, + values, b, x, false); } else { #ifdef KOKKOSKERNELS_SPTRSV_CUDAGRAPHSUPPORT using ExecSpace = typename RowMapType::memory_space::execution_space; if (std::is_same::value) + // TODO: set stream in thandle's sptrsvCudaGraph Experimental::upper_tri_solve_cg(*sptrsv_handle, row_map, entries, values, b, x); else #endif - Experimental::upper_tri_solve(*sptrsv_handle, row_map, entries, + Experimental::upper_tri_solve(space, *sptrsv_handle, row_map, entries, values, b, x); } } @@ -188,7 +192,8 @@ struct SPTRSV_SOLVEis_lower_tri()) { for (int i = 0; i < static_cast(execspace_v.size()); i++) { if (sptrsv_handle_v[i]->is_symbolic_complete() == false) { - Experimental::lower_tri_symbolic(*(sptrsv_handle_v[i]), row_map_v[i], + Experimental::lower_tri_symbolic(execspace_v[i], + *(sptrsv_handle_v[i]), row_map_v[i], entries_v[i]); } } @@ -198,7 +203,8 @@ struct SPTRSV_SOLVE(execspace_v.size()); i++) { if (sptrsv_handle_v[i]->is_symbolic_complete() == false) { - Experimental::upper_tri_symbolic(*(sptrsv_handle_v[i]), row_map_v[i], + Experimental::upper_tri_symbolic(execspace_v[i], + *(sptrsv_handle_v[i]), row_map_v[i], entries_v[i]); } } diff --git a/sparse/impl/KokkosSparse_sptrsv_symbolic_impl.hpp b/sparse/impl/KokkosSparse_sptrsv_symbolic_impl.hpp index 3ef3be8780..36ea2d9df8 100644 --- a/sparse/impl/KokkosSparse_sptrsv_symbolic_impl.hpp +++ b/sparse/impl/KokkosSparse_sptrsv_symbolic_impl.hpp @@ -147,9 +147,10 @@ void symbolic_chain_phase(TriSolveHandle& thandle, #endif } // end symbolic_chain_phase -template -void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, - const EntriesType dentries) { +template +void lower_tri_symbolic(ExecSpaceIn& space, TriSolveHandle& thandle, + const RowMapType drow_map, const EntriesType dentries) { #ifdef TRISOLVE_SYMB_TIMERS Kokkos::Timer timer_sym_lowertri_total; Kokkos::Timer timer; @@ -177,10 +178,10 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, size_type nrows = drow_map.extent(0) - 1; auto row_map = Kokkos::create_mirror_view(drow_map); - Kokkos::deep_copy(row_map, drow_map); + Kokkos::deep_copy(space, row_map, drow_map); auto entries = Kokkos::create_mirror_view(dentries); - Kokkos::deep_copy(entries, dentries); + Kokkos::deep_copy(space, entries, dentries); // get device view - will deep_copy to it at end of this host routine DeviceEntriesType dnodes_per_level = thandle.get_nodes_per_level(); @@ -193,11 +194,12 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, DeviceSignedEntriesType dlevel_list = thandle.get_level_list(); HostSignedEntriesType level_list = Kokkos::create_mirror_view(dlevel_list); - Kokkos::deep_copy(level_list, dlevel_list); + Kokkos::deep_copy(space, level_list, dlevel_list); signed_integral_t level = 0; size_type node_count = 0; + space.fence(); // wait for deep copy write to land typename DeviceEntriesType::HostMirror level_ptr( "lp", nrows + 1); // temp View used for index bookkeeping level_ptr(0) = 0; @@ -227,9 +229,9 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, // Create the chain now if (thandle.algm_requires_symb_chain()) { + // No need to pass in space, chain phase runs on the host symbolic_chain_phase(thandle, nodes_per_level); } - thandle.set_symbolic_complete(); // Output check @@ -257,9 +259,9 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, #endif // Deep copy to device views - Kokkos::deep_copy(dnodes_grouped_by_level, nodes_grouped_by_level); - Kokkos::deep_copy(dnodes_per_level, nodes_per_level); - Kokkos::deep_copy(dlevel_list, level_list); + Kokkos::deep_copy(space, dnodes_grouped_by_level, nodes_grouped_by_level); + Kokkos::deep_copy(space, dnodes_per_level, nodes_per_level); + Kokkos::deep_copy(space, dlevel_list, level_list); // Extra check: #ifdef LVL_OUTPUT_INFO @@ -279,6 +281,7 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, check_count); std::cout << " host check_count= " << check_count << std::endl; + space.fence(); // wait for deep copy writes to land check_count = 0; // reset Kokkos::parallel_reduce( "check_count device", @@ -568,20 +571,21 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, thandle.set_workspace_size(max_lwork); // workspace offset initialized to be zero integer_view_t work_offset = thandle.get_work_offset(); - Kokkos::deep_copy(work_offset, work_offset_host); + Kokkos::deep_copy(space, work_offset, work_offset_host); // kernel types // > off-diagonal integer_view_t dkernel_type_by_level = thandle.get_kernel_type(); - Kokkos::deep_copy(dkernel_type_by_level, kernel_type_by_level); + Kokkos::deep_copy(space, dkernel_type_by_level, kernel_type_by_level); // > diagonal integer_view_t ddiag_kernel_type_by_level = thandle.get_diag_kernel_type(); - Kokkos::deep_copy(ddiag_kernel_type_by_level, diag_kernel_type_by_level); + Kokkos::deep_copy(space, ddiag_kernel_type_by_level, + diag_kernel_type_by_level); // deep copy to device (of scheduling info) - Kokkos::deep_copy(dnodes_grouped_by_level, nodes_grouped_by_level); - Kokkos::deep_copy(dnodes_per_level, nodes_per_level); - Kokkos::deep_copy(dlevel_list, level_list); + Kokkos::deep_copy(space, dnodes_grouped_by_level, nodes_grouped_by_level); + Kokkos::deep_copy(space, dnodes_per_level, nodes_per_level); + Kokkos::deep_copy(space, dlevel_list, level_list); #ifdef TRISOLVE_SYMB_TIMERS std::cout << " + workspace time = " << timer.seconds() << std::endl; @@ -598,9 +602,10 @@ void lower_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, #endif } // end lower_tri_symbolic -template -void upper_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, - const EntriesType dentries) { +template +void upper_tri_symbolic(ExecutionSpace& space, TriSolveHandle& thandle, + const RowMapType drow_map, const EntriesType dentries) { #ifdef TRISOLVE_SYMB_TIMERS Kokkos::Timer timer_sym_uppertri_total; Kokkos::Timer timer; @@ -626,10 +631,10 @@ void upper_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, size_type nrows = drow_map.extent(0) - 1; auto row_map = Kokkos::create_mirror_view(drow_map); - Kokkos::deep_copy(row_map, drow_map); + Kokkos::deep_copy(space, row_map, drow_map); auto entries = Kokkos::create_mirror_view(dentries); - Kokkos::deep_copy(entries, dentries); + Kokkos::deep_copy(space, entries, dentries); // get device view - will deep_copy to it at end of this host routine DeviceEntriesType dnodes_per_level = thandle.get_nodes_per_level(); @@ -642,11 +647,12 @@ void upper_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, DeviceSignedEntriesType dlevel_list = thandle.get_level_list(); HostSignedEntriesType level_list = Kokkos::create_mirror_view(dlevel_list); - Kokkos::deep_copy(level_list, dlevel_list); + Kokkos::deep_copy(space, level_list, dlevel_list); signed_integral_t level = 0; size_type node_count = 0; + space.fence(); // Wait for deep copy writes to land typename DeviceEntriesType::HostMirror level_ptr( "lp", nrows + 1); // temp View used for index bookkeeping level_ptr(0) = 0; @@ -708,9 +714,9 @@ void upper_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, #endif // Deep copy to device views - Kokkos::deep_copy(dnodes_grouped_by_level, nodes_grouped_by_level); - Kokkos::deep_copy(dnodes_per_level, nodes_per_level); - Kokkos::deep_copy(dlevel_list, level_list); + Kokkos::deep_copy(space, dnodes_grouped_by_level, nodes_grouped_by_level); + Kokkos::deep_copy(space, dnodes_per_level, nodes_per_level); + Kokkos::deep_copy(space, dlevel_list, level_list); // Extra check: #ifdef LVL_OUTPUT_INFO @@ -730,6 +736,7 @@ void upper_tri_symbolic(TriSolveHandle& thandle, const RowMapType drow_map, check_count); std::cout << " host check_count= " << check_count << std::endl; + space.fence(); // wait for deep copy writes to land check_count = 0; // reset Kokkos::parallel_reduce( "check_count device", diff --git a/sparse/impl/KokkosSparse_sptrsv_symbolic_spec.hpp b/sparse/impl/KokkosSparse_sptrsv_symbolic_spec.hpp index 73389d10d0..5b9304356d 100644 --- a/sparse/impl/KokkosSparse_sptrsv_symbolic_spec.hpp +++ b/sparse/impl/KokkosSparse_sptrsv_symbolic_spec.hpp @@ -67,33 +67,37 @@ namespace Impl { // Unification layer /// \brief Implementation of KokkosSparse::sptrsv_symbolic -template ::value, bool eti_spec_avail = sptrsv_symbolic_eti_spec_avail< KernelHandle, RowMapType, EntriesType>::value> struct SPTRSV_SYMBOLIC { - static void sptrsv_symbolic(KernelHandle *handle, const RowMapType row_map, + static void sptrsv_symbolic(const ExecutionSpace &space, KernelHandle *handle, + const RowMapType row_map, const EntriesType entries); }; #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY //! Full specialization of sptrsv_symbolic // Unification layer -template -struct SPTRSV_SYMBOLIC { - static void sptrsv_symbolic(KernelHandle *handle, const RowMapType row_map, +template +struct SPTRSV_SYMBOLIC { + static void sptrsv_symbolic(const ExecutionSpace &space, KernelHandle *handle, + const RowMapType row_map, const EntriesType entries) { auto sptrsv_handle = handle->get_sptrsv_handle(); auto nrows = row_map.extent(0) - 1; sptrsv_handle->new_init_handle(nrows); if (sptrsv_handle->is_lower_tri()) { - Experimental::lower_tri_symbolic(*sptrsv_handle, row_map, entries); + Experimental::lower_tri_symbolic(space, *sptrsv_handle, row_map, entries); sptrsv_handle->set_symbolic_complete(); } else { - Experimental::upper_tri_symbolic(*sptrsv_handle, row_map, entries); + Experimental::upper_tri_symbolic(space, *sptrsv_handle, row_map, entries); sptrsv_handle->set_symbolic_complete(); } } @@ -113,6 +117,7 @@ struct SPTRSV_SYMBOLIC, \ @@ -130,6 +135,7 @@ struct SPTRSV_SYMBOLIC, \ diff --git a/sparse/impl/KokkosSparse_trsv_impl.hpp b/sparse/impl/KokkosSparse_trsv_impl.hpp index fbbd547e34..9adb029d12 100644 --- a/sparse/impl/KokkosSparse_trsv_impl.hpp +++ b/sparse/impl/KokkosSparse_trsv_impl.hpp @@ -14,15 +14,20 @@ // //@HEADER -#ifndef KOKKOSSPARSE_IMPL_TRSM_HPP_ -#define KOKKOSSPARSE_IMPL_TRSM_HPP_ +#ifndef KOKKOSSPARSE_TRSV_IMPL_HPP_ +#define KOKKOSSPARSE_TRSV_IMPL_HPP_ -/// \file KokkosSparse_impl_trsm.hpp -/// \brief Implementation(s) of sparse triangular solve. +/// \file KokkosSparse_trsv_impl.hpp +/// \brief Implementation(s) of sequential sparse triangular solve. #include #include -#include // temporarily +#include "KokkosBatched_Axpy.hpp" +#include "KokkosBatched_Gemm_Decl.hpp" +#include "KokkosBatched_Gemm_Serial_Impl.hpp" +#include "KokkosBatched_Gesv.hpp" +#include "KokkosBlas2_gemv.hpp" +#include "KokkosBlas1_set.hpp" namespace KokkosSparse { namespace Impl { @@ -30,652 +35,694 @@ namespace Sequential { template -void lowerTriSolveCsrUnitDiag(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - // const local_ordinal_type numCols = A.numCols (); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type r = 0; r < numRows; ++r) { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = Y(r, j); - } - const offset_type beg = ptr(r); - const offset_type end = ptr(r + 1); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type c = ind(k); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); +struct TrsvWrap { + using offset_type = + typename CrsMatrixType::row_map_type::non_const_value_type; + using lno_t = typename CrsMatrixType::index_type::non_const_value_type; + using scalar_t = typename CrsMatrixType::values_type::non_const_value_type; + using device_t = typename CrsMatrixType::device_type; + using sview_1d = typename Kokkos::View; + using STS = Kokkos::ArithTraits; + + static inline void manual_copy(RangeMultiVectorType X, + DomainMultiVectorType Y) { + auto numRows = X.extent(0); + auto numVecs = X.extent(1); + for (decltype(numRows) i = 0; i < numRows; ++i) { + for (decltype(numVecs) j = 0; j < numVecs; ++j) { + X(i, j) = Y(i, j); } - } // for each entry A_rc in the current row r - } // for each row r -} - -template -void lowerTriSolveCsr(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - typedef Kokkos::ArithTraits STS; - - const local_ordinal_type numRows = A.numRows(); - // const local_ordinal_type numCols = A.numCols (); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type r = 0; r < numRows; ++r) { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = Y(r, j); } + } - matrix_scalar_type A_rr = STS::zero(); - const offset_type beg = ptr(r); - const offset_type end = ptr(r + 1); - - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type c = ind(k); - // FIXME (mfh 28 Aug 2014) This assumes that the diagonal entry - // has equal local row and column indices. That may not - // necessarily hold, depending on the row and column Maps. The - // way to fix this would be for Tpetra::CrsMatrix to remember - // the local column index of the diagonal entry (if there is - // one) in each row, and pass that along to this function. - if (r == c) { - A_rr += A_rc; - } else { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } - } // for each entry A_rc in the current row r - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = X(r, j) / A_rr; + struct CommonUnblocked { + CommonUnblocked(const lno_t block_size) { + KK_REQUIRE_MSG(block_size == 1, + "Tried to use block_size>1 for non-block-enabled Common"); } - } // for each row r -} -template -void upperTriSolveCsrUnitDiag(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - // const local_ordinal_type numCols = A.numCols (); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - // If local_ordinal_type is unsigned and numRows is 0, the loop - // below will have entirely the wrong number of iterations. - if (numRows == 0) { - return; - } + scalar_t zero() { return STS::zero(); } - // Don't use r >= 0 as the test, because that fails if - // local_ordinal_type is unsigned. We do r == 0 (last - // iteration) below. - for (local_ordinal_type r = numRows - 1; r != 0; --r) { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = Y(r, j); - } - const offset_type beg = ptr(r); - const offset_type end = ptr(r + 1); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type c = ind(k); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } // for each entry A_rc in the current row r - } // for each row r - - // Last iteration: r = 0. - { - const local_ordinal_type r = 0; - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = Y(r, j); + template + scalar_t get(const ValuesView& vals, const offset_type i) { + return vals(i); } - const offset_type beg = ptr(r); - const offset_type end = ptr(r + 1); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type c = ind(k); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } // for each entry A_rc in the current row r - } // last iteration: r = 0 -} -template -void upperTriSolveCsr(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - // const local_ordinal_type numCols = A.numCols (); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - typedef Kokkos::ArithTraits STS; - - // If local_ordinal_type is unsigned and numRows is 0, the loop - // below will have entirely the wrong number of iterations. - if (numRows == 0) { - return; - } + void pluseq(scalar_t& lhs, const scalar_t& rhs) { lhs += rhs; } - // Don't use r >= 0 as the test, because that fails if - // local_ordinal_type is unsigned. We do r == 0 (last - // iteration) below. - for (local_ordinal_type r = numRows - 1; r != 0; --r) { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = Y(r, j); + void gemv(RangeMultiVectorType X, const scalar_t& A, const lno_t r, + const lno_t c, const lno_t j, const char = 'N') { + X(r, j) -= A * X(c, j); } - const offset_type beg = ptr(r); - const offset_type end = ptr(r + 1); - matrix_scalar_type A_rr = STS::zero(); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type c = ind(k); - if (r == c) { - A_rr += A_rc; - } else { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } - } // for each entry A_rc in the current row r - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = X(r, j) / A_rr; + + template + void divide(RangeMultiVectorType X, const scalar_t& A, const lno_t r, + const lno_t j) { + X(r, j) /= A; + } + }; + + struct CommonBlocked { + // BSR data is in LayoutRight! + using Layout = Kokkos::LayoutRight; + + using UBlock = Kokkos::View< + scalar_t**, Layout, typename CrsMatrixType::device_type, + Kokkos::MemoryTraits >; + + using Block = + Kokkos::View >; + + using Vector = Kokkos::View >; + + using UVector = Kokkos::View< + scalar_t*, typename CrsMatrixType::device_type, + Kokkos::MemoryTraits >; + + lno_t m_block_size; + lno_t m_block_items; + Vector m_ones; + Block m_data; + Block m_tmp; // Needed for SerialGesv + UBlock m_utmp; // Needed for SerialGesv + Vector m_vec_data1; + Vector m_vec_data2; + + CommonBlocked(const lno_t block_size) + : m_block_size(block_size), + m_block_items(block_size * block_size), + m_ones("ones", block_size), + m_data("m_data", block_size, block_size), + m_tmp("m_tmp", block_size, block_size + 4), + m_utmp(m_tmp.data(), block_size, block_size + 4), + m_vec_data1("m_vec_data1", block_size), + m_vec_data2("m_vec_data2", block_size) { + Kokkos::deep_copy(m_ones, 1.0); } - } // for each row r - // Last iteration: r = 0. - { - const local_ordinal_type r = 0; - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = Y(r, j); + UBlock zero() { + UBlock block(m_data.data(), m_block_size, m_block_size); + KokkosBlas::SerialSet::invoke(STS::zero(), block); + return block; } - const offset_type beg = ptr(r); - const offset_type end = ptr(r + 1); - matrix_scalar_type A_rr = STS::zero(); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type c = ind(k); - if (r == c) - A_rr += A_rc; - else { - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } - } // for each entry A_rc in the current row r - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) = X(r, j) / A_rr; + + template + UBlock get(const ValuesView& vals, const offset_type i) { + scalar_t* data = const_cast(vals.data()); + UBlock rv(data + (i * m_block_items), m_block_size, m_block_size); + return rv; } - } // last iteration: r = 0 -} -template -void upperTriSolveCscUnitDiag(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); + void pluseq(UBlock& lhs, const UBlock& rhs) { + KokkosBatched::SerialAxpy::invoke(m_ones, rhs, lhs); } - } - // If local_ordinal_type is unsigned and numCols is 0, the loop - // below will have entirely the wrong number of iterations. - if (numCols == 0) { - return; - } + void gemv(RangeMultiVectorType X, const UBlock& A, const lno_t r, + const lno_t c, const lno_t j, const char transpose = 'N') { + // Create and populate x and y + UVector x(m_vec_data1.data(), m_block_size); + UVector y(m_vec_data2.data(), m_block_size); + for (lno_t b = 0; b < m_block_size; ++b) { + x(b) = X(c * m_block_size + b, j); + y(b) = X(r * m_block_size + b, j); + } + + KokkosBlas::Experimental::serial_gemv(transpose, -1, A, x, 1, y); - // Don't use c >= 0 as the test, because that fails if - // local_ordinal_type is unsigned. We do c == 0 (last - // iteration) below. - for (local_ordinal_type c = numCols - 1; c != 0; --c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type r = ind(k); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); + for (lno_t b = 0; b < m_block_size; ++b) { + X(r * m_block_size + b, j) = y(b); } - } // for each entry A_rc in the current column c - } // for each column c - - // Last iteration: c = 0. - { - const local_ordinal_type c = 0; - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const matrix_scalar_type A_rc = val(k); - const local_ordinal_type r = ind(k); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); + } + + template + void divide(RangeMultiVectorType X, const UBlock& A, const lno_t r, + const lno_t j) { + UVector x(m_vec_data1.data(), m_block_size); + UVector y(m_vec_data2.data(), m_block_size); + for (lno_t b = 0; b < m_block_size; ++b) { + y(b) = X(r * m_block_size + b, j); } - } // for each entry A_rc in the current column c - } -} -template -void upperTriSolveCsc(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); - } - } + // if StaticPivoting is used, there are compiler errors related to + // comparing complex and non-complex. + using Algo = KokkosBatched::Gesv::NoPivoting; + + KokkosBatched::SerialGesv::invoke(A, x, y, m_utmp); - // If local_ordinal_type is unsigned and numCols is 0, the loop - // below will have entirely the wrong number of iterations. - if (numCols == 0) { - return; + for (lno_t b = 0; b < m_block_size; ++b) { + X(r * m_block_size + b, j) = x(b); + } + } + }; + + using CommonOps = std::conditional_t< + KokkosSparse::Experimental::is_bsr_matrix::value, + CommonBlocked, CommonUnblocked>; + + static void lowerTriSolveCsrUnitDiag(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + for (lno_t r = 0; r < numRows; ++r) { + const offset_type beg = ptr(r); + const offset_type end = ptr(r + 1); + for (offset_type k = beg; k < end; ++k) { + const scalar_t A_rc = val(k); + const lno_t c = ind(k); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current row r + } // for each row r } - // Don't use c >= 0 as the test, because that fails if - // local_ordinal_type is unsigned. We do c == 0 (last - // iteration) below. - for (local_ordinal_type c = numCols - 1; c != 0; --c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = end - 1; k >= beg; --k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = val(k); - /*(vqd 20 Jul 2020) This assumes that the diagonal entry - has equal local row and column indices. That may not - necessarily hold, depending on the row and column Maps. See - note above.*/ - for (local_ordinal_type j = 0; j < numVecs; ++j) { + static void lowerTriSolveCsr(RangeMultiVectorType X, const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + CommonOps co(block_size); + + manual_copy(X, Y); + + for (lno_t r = 0; r < numRows; ++r) { + auto A_rr = co.zero(); + const offset_type beg = ptr(r); + const offset_type end = ptr(r + 1); + + for (offset_type k = beg; k < end; ++k) { + const auto A_rc = co.get(val, k); + const lno_t c = ind(k); + // FIXME (mfh 28 Aug 2014) This assumes that the diagonal entry + // has equal local row and column indices. That may not + // necessarily hold, depending on the row and column Maps. The + // way to fix this would be for Tpetra::CrsMatrix to remember + // the local column index of the diagonal entry (if there is + // one) in each row, and pass that along to this function. if (r == c) { - X(c, j) = X(c, j) / A_rc; + co.pluseq(A_rr, A_rc); } else { - X(r, j) -= A_rc * X(c, j); + for (lno_t j = 0; j < numVecs; ++j) { + co.gemv(X, A_rc, r, c, j); + } } + } // for each entry A_rc in the current row r + for (lno_t j = 0; j < numVecs; ++j) { + co.template divide(X, A_rr, r, j); } - } // for each entry A_rc in the current column c - } // for each column c - - // Last iteration: c = 0. - { - const offset_type beg = ptr(0); - const matrix_scalar_type A_rc = val(beg); - /*(vqd 20 Jul 2020) This assumes that the diagonal entry - has equal local row and column indices. That may not - necessarily hold, depending on the row and column Maps. See - note above.*/ - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(0, j) = X(0, j) / A_rc; - } + } // for each row r } -} -template -void lowerTriSolveCscUnitDiag(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); + static void upperTriSolveCsrUnitDiag(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + // If lno_t is unsigned and numRows is 0, the loop + // below will have entirely the wrong number of iterations. + if (numRows == 0) { + return; } + + // Don't use r >= 0 as the test, because that fails if + // lno_t is unsigned. We do r == 0 (last + // iteration) below. + for (lno_t r = numRows - 1; r != 0; --r) { + const offset_type beg = ptr(r); + const offset_type end = ptr(r + 1); + for (offset_type k = beg; k < end; ++k) { + const scalar_t A_rc = val(k); + const lno_t c = ind(k); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current row r + } // for each row r + + // Last iteration: r = 0. + { + const lno_t r = 0; + const offset_type beg = ptr(r); + const offset_type end = ptr(r + 1); + for (offset_type k = beg; k < end; ++k) { + const scalar_t A_rc = val(k); + const lno_t c = ind(k); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current row r + } // last iteration: r = 0 } - for (local_ordinal_type c = 0; c < numCols; ++c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = val(k); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } // for each entry A_rc in the current column c - } // for each column c -} + static void upperTriSolveCsr(RangeMultiVectorType X, const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; -template -void upperTriSolveCscUnitDiagConj(RangeMultiVectorType X, - const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - typedef Kokkos::ArithTraits STS; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); - } - } + CommonOps co(block_size); - // If local_ordinal_type is unsigned and numCols is 0, the loop - // below will have entirely the wrong number of iterations. - if (numCols == 0) { - return; - } + manual_copy(X, Y); - // Don't use c >= 0 as the test, because that fails if - // local_ordinal_type is unsigned. We do c == 0 (last - // iteration) below. - for (local_ordinal_type c = numCols - 1; c != 0; --c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = STS::conj(val(k)); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); + // If lno_t is unsigned and numRows is 0, the loop + // below will have entirely the wrong number of iterations. + if (numRows == 0) { + return; + } + + // Don't use r >= 0 as the test, because that fails if + // lno_t is unsigned. We do r == 0 (last + // iteration) below. + for (lno_t r = numRows - 1; r != 0; --r) { + const offset_type beg = ptr(r); + const offset_type end = ptr(r + 1); + auto A_rr = co.zero(); + for (offset_type k = beg; k < end; ++k) { + const auto A_rc = co.get(val, k); + const lno_t c = ind(k); + if (r == c) { + co.pluseq(A_rr, A_rc); + } else { + for (lno_t j = 0; j < numVecs; ++j) { + co.gemv(X, A_rc, r, c, j); + } + } + } // for each entry A_rc in the current row r + for (lno_t j = 0; j < numVecs; ++j) { + co.template divide(X, A_rr, r, j); } - } // for each entry A_rc in the current column c - } // for each column c - - // Last iteration: c = 0. - { - const local_ordinal_type c = 0; - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = STS::conj(val(k)); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); + } // for each row r + + // Last iteration: r = 0. + { + const lno_t r = 0; + const offset_type beg = ptr(r); + const offset_type end = ptr(r + 1); + auto A_rr = co.zero(); + for (offset_type k = beg; k < end; ++k) { + const auto A_rc = co.get(val, k); + const lno_t c = ind(k); + if (r == c) { + co.pluseq(A_rr, A_rc); + } else { + for (lno_t j = 0; j < numVecs; ++j) { + co.gemv(X, A_rc, r, c, j); + } + } + } // for each entry A_rc in the current row r + for (lno_t j = 0; j < numVecs; ++j) { + co.template divide(X, A_rr, r, j); } - } // for each entry A_rc in the current column c + } // last iteration: r = 0 } -} -template -void upperTriSolveCscConj(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - typedef Kokkos::ArithTraits STS; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); + static void upperTriSolveCscUnitDiag(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + // If lno_t is unsigned and numCols is 0, the loop + // below will have entirely the wrong number of iterations. + if (numCols == 0) { + return; } - } - - // If local_ordinal_type is unsigned and numCols is 0, the loop - // below will have entirely the wrong number of iterations. - if (numCols == 0) { - return; - } - // Don't use c >= 0 as the test, because that fails if - // local_ordinal_type is unsigned. We do c == 0 (last - // iteration) below. - for (local_ordinal_type c = numCols - 1; c != 0; --c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = end - 1; k >= beg; --k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = STS::conj(val(k)); - /*(vqd 20 Jul 2020) This assumes that the diagonal entry - has equal local row and column indices. That may not - necessarily hold, depending on the row and column Maps. See - note above.*/ - for (local_ordinal_type j = 0; j < numVecs; ++j) { - if (r == c) { - X(c, j) = X(c, j) / A_rc; - } else { + // Don't use c >= 0 as the test, because that fails if + // lno_t is unsigned. We do c == 0 (last + // iteration) below. + for (lno_t c = numCols - 1; c != 0; --c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const scalar_t A_rc = val(k); + const lno_t r = ind(k); + for (lno_t j = 0; j < numVecs; ++j) { X(r, j) -= A_rc * X(c, j); } - } - } // for each entry A_rc in the current column c - } // for each column c - - // Last iteration: c = 0. - { - const offset_type beg = ptr(0); - const matrix_scalar_type A_rc = STS::conj(val(beg)); - /*(vqd 20 Jul 2020) This assumes that the diagonal entry - has equal local row and column indices. That may not - necessarily hold, depending on the row and column Maps. See - note above.*/ - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(0, j) = X(0, j) / A_rc; + } // for each entry A_rc in the current column c + } // for each column c + + // Last iteration: c = 0. + { + const lno_t c = 0; + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const scalar_t A_rc = val(k); + const lno_t r = ind(k); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current column c } } -} -template -void lowerTriSolveCsc(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); + static void upperTriSolveCsc(RangeMultiVectorType X, const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + manual_copy(X, Y); + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + // If lno_t is unsigned and numCols is 0, the loop + // below will have entirely the wrong number of iterations. + if (numCols == 0) { + return; } - } - for (local_ordinal_type c = 0; c < numCols; ++c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = val(k); + // Don't use c >= 0 as the test, because that fails if + // lno_t is unsigned. We do c == 0 (last + // iteration) below. + for (lno_t c = numCols - 1; c != 0; --c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = end - 1; k >= beg; --k) { + const lno_t r = ind(k); + const auto A_rc = val(k); + /*(vqd 20 Jul 2020) This assumes that the diagonal entry + has equal local row and column indices. That may not + necessarily hold, depending on the row and column Maps. See + note above.*/ + for (lno_t j = 0; j < numVecs; ++j) { + if (r == c) { + X(c, j) = X(c, j) / A_rc; + } else { + X(r, j) -= A_rc * X(c, j); + } + } + } // for each entry A_rc in the current column c + } // for each column c + + // Last iteration: c = 0. + { + const offset_type beg = ptr(0); + const auto A_rc = val(beg); /*(vqd 20 Jul 2020) This assumes that the diagonal entry has equal local row and column indices. That may not necessarily hold, depending on the row and column Maps. See note above.*/ - for (local_ordinal_type j = 0; j < numVecs; ++j) { - if (r == c) { - X(c, j) = X(c, j) / A_rc; - } else { - X(r, j) -= A_rc * X(c, j); - } + for (lno_t j = 0; j < numVecs; ++j) { + X(0, j) = X(0, j) / A_rc; } - } // for each entry A_rc in the current column c - } // for each column c -} - -template -void lowerTriSolveCscUnitDiagConj(RangeMultiVectorType X, - const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - typedef Kokkos::ArithTraits STS; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); } } - for (local_ordinal_type c = 0; c < numCols; ++c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = STS::conj(val(k)); - for (local_ordinal_type j = 0; j < numVecs; ++j) { - X(r, j) -= A_rc * X(c, j); - } - } // for each entry A_rc in the current column c - } // for each column c -} + static void lowerTriSolveCscUnitDiag(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + for (lno_t c = 0; c < numCols; ++c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const lno_t r = ind(k); + const scalar_t A_rc = val(k); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current column c + } // for each column c + } -template -void lowerTriSolveCscConj(RangeMultiVectorType X, const CrsMatrixType& A, - DomainMultiVectorType Y) { - typedef - typename CrsMatrixType::row_map_type::non_const_value_type offset_type; - typedef typename CrsMatrixType::index_type::non_const_value_type - local_ordinal_type; - typedef typename CrsMatrixType::values_type::non_const_value_type - matrix_scalar_type; - typedef Kokkos::ArithTraits STS; - - const local_ordinal_type numRows = A.numRows(); - const local_ordinal_type numCols = A.numCols(); - const local_ordinal_type numVecs = X.extent(1); - typename CrsMatrixType::row_map_type ptr = A.graph.row_map; - typename CrsMatrixType::index_type ind = A.graph.entries; - typename CrsMatrixType::values_type val = A.values; - - for (local_ordinal_type j = 0; j < numVecs; ++j) { - for (local_ordinal_type i = 0; i < numRows; ++i) { - X(i, j) = Y(i, j); + static void upperTriSolveCscUnitDiagConj(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + // If lno_t is unsigned and numCols is 0, the loop + // below will have entirely the wrong number of iterations. + if (numCols == 0) { + return; + } + + // Don't use c >= 0 as the test, because that fails if + // lno_t is unsigned. We do c == 0 (last + // iteration) below. + for (lno_t c = numCols - 1; c != 0; --c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const lno_t r = ind(k); + const scalar_t A_rc = STS::conj(val(k)); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current column c + } // for each column c + + // Last iteration: c = 0. + { + const lno_t c = 0; + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const lno_t r = ind(k); + const scalar_t A_rc = STS::conj(val(k)); + for (lno_t j = 0; j < numVecs; ++j) { + X(r, j) -= A_rc * X(c, j); + } + } // for each entry A_rc in the current column c } } - for (local_ordinal_type c = 0; c < numCols; ++c) { - const offset_type beg = ptr(c); - const offset_type end = ptr(c + 1); - for (offset_type k = beg; k < end; ++k) { - const local_ordinal_type r = ind(k); - const matrix_scalar_type A_rc = STS::conj(val(k)); + static void upperTriSolveCscConj(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + // If lno_t is unsigned and numCols is 0, the loop + // below will have entirely the wrong number of iterations. + if (numCols == 0) { + return; + } + + // Don't use c >= 0 as the test, because that fails if + // lno_t is unsigned. We do c == 0 (last + // iteration) below. + for (lno_t c = numCols - 1; c != 0; --c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = end - 1; k >= beg; --k) { + const lno_t r = ind(k); + const scalar_t A_rc = STS::conj(val(k)); + /*(vqd 20 Jul 2020) This assumes that the diagonal entry + has equal local row and column indices. That may not + necessarily hold, depending on the row and column Maps. See + note above.*/ + for (lno_t j = 0; j < numVecs; ++j) { + if (r == c) { + X(c, j) = X(c, j) / A_rc; + } else { + X(r, j) -= A_rc * X(c, j); + } + } + } // for each entry A_rc in the current column c + } // for each column c + + // Last iteration: c = 0. + { + const offset_type beg = ptr(0); + const scalar_t A_rc = STS::conj(val(beg)); /*(vqd 20 Jul 2020) This assumes that the diagonal entry has equal local row and column indices. That may not necessarily hold, depending on the row and column Maps. See note above.*/ - for (local_ordinal_type j = 0; j < numVecs; ++j) { - if (r == c) { - X(c, j) = X(c, j) / A_rc; - } else { + for (lno_t j = 0; j < numVecs; ++j) { + X(0, j) = X(0, j) / A_rc; + } + } + } + + static void lowerTriSolveCsc(RangeMultiVectorType X, const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + for (lno_t c = 0; c < numCols; ++c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const lno_t r = ind(k); + const scalar_t A_rc = val(k); + /*(vqd 20 Jul 2020) This assumes that the diagonal entry + has equal local row and column indices. That may not + necessarily hold, depending on the row and column Maps. See + note above.*/ + for (lno_t j = 0; j < numVecs; ++j) { + if (r == c) { + X(c, j) = X(c, j) / A_rc; + } else { + X(r, j) -= A_rc * X(c, j); + } + } + } // for each entry A_rc in the current column c + } // for each column c + } + + static void lowerTriSolveCscUnitDiagConj(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + for (lno_t c = 0; c < numCols; ++c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const lno_t r = ind(k); + const scalar_t A_rc = STS::conj(val(k)); + for (lno_t j = 0; j < numVecs; ++j) { X(r, j) -= A_rc * X(c, j); } - } - } // for each entry A_rc in the current column c - } // for each column c -} + } // for each entry A_rc in the current column c + } // for each column c + } + + static void lowerTriSolveCscConj(RangeMultiVectorType X, + const CrsMatrixType& A, + DomainMultiVectorType Y) { + const lno_t numRows = A.numRows(); + const lno_t numCols = A.numCols(); + const lno_t numPointRows = A.numPointRows(); + const lno_t block_size = numPointRows / numRows; + const lno_t numVecs = X.extent(1); + typename CrsMatrixType::row_map_type ptr = A.graph.row_map; + typename CrsMatrixType::index_type ind = A.graph.entries; + typename CrsMatrixType::values_type val = A.values; + + KK_REQUIRE_MSG(block_size == 1, "BSRs not support for this function yet"); + + manual_copy(X, Y); + + for (lno_t c = 0; c < numCols; ++c) { + const offset_type beg = ptr(c); + const offset_type end = ptr(c + 1); + for (offset_type k = beg; k < end; ++k) { + const lno_t r = ind(k); + const scalar_t A_rc = STS::conj(val(k)); + /*(vqd 20 Jul 2020) This assumes that the diagonal entry + has equal local row and column indices. That may not + necessarily hold, depending on the row and column Maps. See + note above.*/ + for (lno_t j = 0; j < numVecs; ++j) { + if (r == c) { + X(c, j) = X(c, j) / A_rc; + } else { + X(r, j) -= A_rc * X(c, j); + } + } + } // for each entry A_rc in the current column c + } // for each column c + } +}; } // namespace Sequential } // namespace Impl } // namespace KokkosSparse -#endif // KOKKOSSPARSE_IMPL_TRSM_HPP +#endif // KOKKOSSPARSE_TRSV_IMPL_HPP_ diff --git a/sparse/impl/KokkosSparse_trsv_spec.hpp b/sparse/impl/KokkosSparse_trsv_spec.hpp index 2e838337d2..a74f4ffe64 100644 --- a/sparse/impl/KokkosSparse_trsv_spec.hpp +++ b/sparse/impl/KokkosSparse_trsv_spec.hpp @@ -20,6 +20,7 @@ #include #include #include "KokkosSparse_CrsMatrix.hpp" +#include "KokkosSparse_BsrMatrix.hpp" // Include the actual functors #if !defined(KOKKOSKERNELS_ETI_ONLY) || KOKKOSKERNELS_IMPL_COMPILE_LIBRARY @@ -55,6 +56,22 @@ struct trsv_eti_spec_avail { Kokkos::Device, \ Kokkos::MemoryTraits > > { \ enum : bool { value = true }; \ + }; \ + \ + template <> \ + struct trsv_eti_spec_avail< \ + KokkosSparse::Experimental::BsrMatrix< \ + const SCALAR_TYPE, const ORDINAL_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, const OFFSET_TYPE>, \ + Kokkos::View< \ + const SCALAR_TYPE **, LAYOUT_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ }; // Include the actual specialization declarations @@ -93,50 +110,52 @@ struct TRSV; if (trans[0] == 'N' || trans[0] == 'n') { // no transpose if (uplo[0] == 'L' || uplo[0] == 'l') { // lower triangular if (diag[0] == 'U' || diag[0] == 'u') { // unit diagonal - Sequential::lowerTriSolveCsrUnitDiag(X, A, B); + Wrap::lowerTriSolveCsrUnitDiag(X, A, B); } else { // non unit diagonal - Sequential::lowerTriSolveCsr(X, A, B); + Wrap::lowerTriSolveCsr(X, A, B); } } else { // upper triangular if (diag[0] == 'U' || diag[0] == 'u') { // unit diagonal - Sequential::upperTriSolveCsrUnitDiag(X, A, B); + Wrap::upperTriSolveCsrUnitDiag(X, A, B); } else { // non unit diagonal - Sequential::upperTriSolveCsr(X, A, B); + Wrap::upperTriSolveCsr(X, A, B); } } } else if (trans[0] == 'T' || trans[0] == 't') { // transpose if (uplo[0] == 'L' || uplo[0] == 'l') { // lower triangular // Transposed lower tri CSR => upper tri CSC. if (diag[0] == 'U' || diag[0] == 'u') { // unit diagonal - Sequential::upperTriSolveCscUnitDiag(X, A, B); + Wrap::upperTriSolveCscUnitDiag(X, A, B); } else { // non unit diagonal - Sequential::upperTriSolveCsc(X, A, B); + Wrap::upperTriSolveCsc(X, A, B); } } else { // upper triangular // Transposed upper tri CSR => lower tri CSC. if (diag[0] == 'U' || diag[0] == 'u') { // unit diagonal - Sequential::lowerTriSolveCscUnitDiag(X, A, B); + Wrap::lowerTriSolveCscUnitDiag(X, A, B); } else { // non unit diagonal - Sequential::lowerTriSolveCsc(X, A, B); + Wrap::lowerTriSolveCsc(X, A, B); } } } else if (trans[0] == 'C' || trans[0] == 'c') { // conj transpose if (uplo[0] == 'L' || uplo[0] == 'l') { // lower triangular // Transposed lower tri CSR => upper tri CSC. if (diag[0] == 'U' || diag[0] == 'u') { // unit diagonal - Sequential::upperTriSolveCscUnitDiagConj(X, A, B); + Wrap::upperTriSolveCscUnitDiagConj(X, A, B); } else { // non unit diagonal - Sequential::upperTriSolveCscConj(X, A, B); + Wrap::upperTriSolveCscConj(X, A, B); } } else { // upper triangular // Transposed upper tri CSR => lower tri CSC. if (diag[0] == 'U' || diag[0] == 'u') { // unit diagonal - Sequential::lowerTriSolveCscUnitDiagConj(X, A, B); + Wrap::lowerTriSolveCscUnitDiagConj(X, A, B); } else { // non unit diagonal - Sequential::lowerTriSolveCscConj(X, A, B); + Wrap::lowerTriSolveCscConj(X, A, B); } } } @@ -169,6 +188,20 @@ struct TRSV, \ Kokkos::MemoryTraits >, \ + false, true>; \ + \ + extern template struct TRSV< \ + KokkosSparse::Experimental::BsrMatrix< \ + const SCALAR_TYPE, const ORDINAL_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, const OFFSET_TYPE>, \ + Kokkos::View< \ + const SCALAR_TYPE **, LAYOUT_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ false, true>; #define KOKKOSSPARSE_TRSV_ETI_SPEC_INST(SCALAR_TYPE, ORDINAL_TYPE, \ @@ -186,6 +219,20 @@ struct TRSV, \ Kokkos::MemoryTraits >, \ + false, true>; \ + \ + template struct TRSV< \ + KokkosSparse::Experimental::BsrMatrix< \ + const SCALAR_TYPE, const ORDINAL_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, const OFFSET_TYPE>, \ + Kokkos::View< \ + const SCALAR_TYPE **, LAYOUT_TYPE, \ + Kokkos::Device, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ false, true>; #include diff --git a/sparse/src/KokkosKernels_Handle.hpp b/sparse/src/KokkosKernels_Handle.hpp index d500f19d48..680045823e 100644 --- a/sparse/src/KokkosKernels_Handle.hpp +++ b/sparse/src/KokkosKernels_Handle.hpp @@ -605,18 +605,18 @@ class KokkosKernelsHandle { // clang-format off /** * @brief Create a gauss seidel handle object - * + * * @param handle_exec_space The execution space instance to execute kernels on. * @param num_streams The number of streams to allocate memory for. * @param gs_algorithm Specifies which algorithm to use: - * + * * KokkosSpace::GS_DEFAULT PointGaussSeidel * KokkosSpace::GS_PERMUTED ?? * KokkosSpace::GS_TEAM ?? * KokkosSpace::GS_CLUSTER ?? * KokkosSpace::GS_TWOSTAGE ?? * @param coloring_algorithm Specifies which coloring algorithm to color the graph with: - * + * * KokkosGraph::COLORING_DEFAULT ?? * KokkosGraph::COLORING_SERIAL Serial Greedy Coloring * KokkosGraph::COLORING_VB Vertex Based Coloring @@ -649,9 +649,9 @@ class KokkosKernelsHandle { // clang-format off /** * @brief Create a gauss seidel handle object - * + * * @param gs_algorithm Specifies which algorithm to use: - * + * * KokkosSpace::GS_DEFAULT PointGaussSeidel or BlockGaussSeidel, depending on matrix type. * KokkosSpace::GS_PERMUTED Reorders rows/cols into colors to improve locality. Uses RangePolicy over rows. * KokkosSpace::GS_TEAM Uses TeamPolicy over batches of rows with ThreadVector within rows. @@ -660,7 +660,7 @@ class KokkosKernelsHandle { * KokkosSpace::GS_TWOSTAGE Uses spmv to parallelize inner sweeps of x. * For more information, see: https://arxiv.org/pdf/2104.01196.pdf. * @param coloring_algorithm Specifies which coloring algorithm to color the graph with: - * + * * KokkosGraph::COLORING_DEFAULT Depends on execution space: * COLORING_SERIAL on Kokkos::Serial; * COLORING_EB on GPUs; @@ -744,16 +744,16 @@ class KokkosKernelsHandle { // clang-format off /** * @brief Create a gs handle object - * + * * @param clusterAlgo Specifies which clustering algorithm to use: - * - * KokkosSparse::ClusteringAlgorithm::CLUSTER_DEFAULT ?? - * KokkosSparse::ClusteringAlgorithm::CLUSTER_MIS2 ?? - * KokkosSparse::ClusteringAlgorithm::CLUSTER_BALLOON ?? - * KokkosSparse::ClusteringAlgorithm::NUM_CLUSTERING_ALGORITHMS ?? + * + * KokkosSparse::CLUSTER_DEFAULT ?? + * KokkosSparse::CLUSTER_MIS2 ?? + * KokkosSparse::CLUSTER_BALLOON ?? + * KokkosSparse::NUM_CLUSTERING_ALGORITHMS ?? * @param hint_verts_per_cluster Hint how many verticies to use per cluster * @param coloring_algorithm Specifies which coloring algorithm to color the graph with: - * + * * KokkosGraph::COLORING_DEFAULT ?? * KokkosGraph::COLORING_SERIAL Serial Greedy Coloring * KokkosGraph::COLORING_VB Vertex Based Coloring @@ -821,10 +821,11 @@ class KokkosKernelsHandle { // ---------------------------------------- // SPADDHandleType *get_spadd_handle() { return this->spaddHandle; } - void create_spadd_handle(bool input_sorted) { + void create_spadd_handle(bool input_sorted = false, + bool input_merged = false) { this->destroy_spadd_handle(); this->is_owner_of_the_spadd_handle = true; - this->spaddHandle = new SPADDHandleType(input_sorted); + this->spaddHandle = new SPADDHandleType(input_sorted, input_merged); } void destroy_spadd_handle() { if (is_owner_of_the_spadd_handle && this->spaddHandle != NULL) { @@ -947,11 +948,13 @@ class KokkosKernelsHandle { SPILUKHandleType *get_spiluk_handle() { return this->spilukHandle; } void create_spiluk_handle(KokkosSparse::Experimental::SPILUKAlgorithm algm, - size_type nrows, size_type nnzL, size_type nnzU) { + size_type nrows, size_type nnzL, size_type nnzU, + size_type block_size = 0) { this->destroy_spiluk_handle(); this->is_owner_of_the_spiluk_handle = true; - this->spilukHandle = new SPILUKHandleType(algm, nrows, nnzL, nnzU); - this->spilukHandle->reset_handle(nrows, nnzL, nnzU); + this->spilukHandle = + new SPILUKHandleType(algm, nrows, nnzL, nnzU, block_size); + this->spilukHandle->reset_handle(nrows, nnzL, nnzU, block_size); this->spilukHandle->set_team_size(this->team_work_size); this->spilukHandle->set_vector_size(this->vector_size); } diff --git a/sparse/src/KokkosSparse_BsrMatrix.hpp b/sparse/src/KokkosSparse_BsrMatrix.hpp index e0d6e61a3b..db9ef71753 100644 --- a/sparse/src/KokkosSparse_BsrMatrix.hpp +++ b/sparse/src/KokkosSparse_BsrMatrix.hpp @@ -1108,6 +1108,10 @@ template struct is_bsr_matrix> : public std::true_type {}; template struct is_bsr_matrix> : public std::true_type {}; + +/// \brief Equivalent to is_bsr_matrix::value. +template +inline constexpr bool is_bsr_matrix_v = is_bsr_matrix::value; //---------------------------------------------------------------------------- } // namespace Experimental diff --git a/sparse/src/KokkosSparse_CrsMatrix.hpp b/sparse/src/KokkosSparse_CrsMatrix.hpp index 7070172a1f..ce9ec99e4e 100644 --- a/sparse/src/KokkosSparse_CrsMatrix.hpp +++ b/sparse/src/KokkosSparse_CrsMatrix.hpp @@ -867,5 +867,9 @@ struct is_crs_matrix> : public std::true_type {}; template struct is_crs_matrix> : public std::true_type {}; +/// \brief Equivalent to is_crs_matrix::value. +template +inline constexpr bool is_crs_matrix_v = is_crs_matrix::value; + } // namespace KokkosSparse #endif diff --git a/sparse/src/KokkosSparse_LUPrec.hpp b/sparse/src/KokkosSparse_LUPrec.hpp index a257b8f09c..d687c8dd4f 100644 --- a/sparse/src/KokkosSparse_LUPrec.hpp +++ b/sparse/src/KokkosSparse_LUPrec.hpp @@ -24,6 +24,7 @@ #include #include #include +#include namespace KokkosSparse { namespace Experimental { @@ -45,8 +46,9 @@ class LUPrec : public KokkosSparse::Experimental::Preconditioner { using ScalarType = typename std::remove_const::type; using EXSP = typename CRS::execution_space; using MEMSP = typename CRS::memory_space; + using DEVICE = typename Kokkos::Device; using karith = typename Kokkos::ArithTraits; - using View1d = typename Kokkos::View; + using View1d = typename Kokkos::View; private: // trsm takes host views @@ -61,11 +63,11 @@ class LUPrec : public KokkosSparse::Experimental::Preconditioner { LUPrec(const CRSArg &L, const CRSArg &U) : _L(L), _U(U), - _tmp("LUPrec::_tmp", L.numRows()), - _tmp2("LUPrec::_tmp", L.numRows()), + _tmp("LUPrec::_tmp", L.numPointRows()), + _tmp2("LUPrec::_tmp", L.numPointRows()), _khL(), _khU() { - KK_REQUIRE_MSG(L.numRows() == U.numRows(), + KK_REQUIRE_MSG(L.numPointRows() == U.numPointRows(), "LUPrec: L.numRows() != U.numRows()"); _khL.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, L.numRows(), @@ -80,22 +82,13 @@ class LUPrec : public KokkosSparse::Experimental::Preconditioner { _khU.destroy_sptrsv_handle(); } - ///// \brief Apply the preconditioner to X, putting the result in Y. - ///// - ///// \tparam XViewType Input vector, as a 1-D Kokkos::View - ///// \tparam YViewType Output vector, as a nonconst 1-D Kokkos::View - ///// - ///// \param transM [in] Not used. - ///// \param alpha [in] Not used - ///// \param beta [in] Not used. - ///// - ///// It takes L and U and the stores U^inv L^inv X in Y - // - virtual void apply( - const Kokkos::View> &X, - const Kokkos::View> &Y, - const char transM[] = "N", ScalarType alpha = karith::one(), - ScalarType beta = karith::zero()) const { + template < + typename Matrix, + typename std::enable_if::value>::type * = nullptr> + void apply_impl(const Kokkos::View &X, + const Kokkos::View &Y, + const char transM[] = "N", ScalarType alpha = karith::one(), + ScalarType beta = karith::zero()) const { // tmp = trsv(L, x); //Apply L^inv to x // y = trsv(U, tmp); //Apply U^inv to tmp @@ -111,6 +104,62 @@ class LUPrec : public KokkosSparse::Experimental::Preconditioner { KokkosBlas::axpby(alpha, _tmp2, beta, Y); } + + template < + typename Matrix, + typename std::enable_if::value>::type * = nullptr> + void apply_impl(const Kokkos::View &X, + const Kokkos::View &Y, + const char transM[] = "N", ScalarType alpha = karith::one(), + ScalarType beta = karith::zero()) const { + // tmp = trsv(L, x); //Apply L^inv to x + // y = trsv(U, tmp); //Apply U^inv to tmp + + KK_REQUIRE_MSG(transM[0] == NoTranspose[0], + "LUPrec::apply only supports 'N' for transM"); + +#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) + using Layout = Kokkos::LayoutLeft; +#else + using Layout = Kokkos::LayoutRight; +#endif + + // trsv is implemented for MV so we need to convert our views + using UView2d = typename Kokkos::View< + ScalarType **, Layout, DEVICE, + Kokkos::MemoryTraits >; + using UView2dc = typename Kokkos::View< + const ScalarType **, Layout, DEVICE, + Kokkos::MemoryTraits >; + UView2dc X2d(X.data(), X.extent(0), 1); + UView2d Y2d(Y.data(), Y.extent(0), 1), + tmp2d(_tmp.data(), _tmp.extent(0), 1), + tmp22d(_tmp2.data(), _tmp2.extent(0), 1); + + KokkosSparse::trsv("L", "N", "N", _L, X2d, tmp2d); + KokkosSparse::trsv("U", "N", "N", _U, tmp2d, tmp22d); + + KokkosBlas::axpby(alpha, _tmp2, beta, Y); + } + + ///// \brief Apply the preconditioner to X, putting the result in Y. + ///// + ///// \tparam XViewType Input vector, as a 1-D Kokkos::View + ///// \tparam YViewType Output vector, as a nonconst 1-D Kokkos::View + ///// + ///// \param transM [in] Not used. + ///// \param alpha [in] Not used + ///// \param beta [in] Not used. + ///// + ///// It takes L and U and the stores U^inv L^inv X in Y + // + virtual void apply(const Kokkos::View &X, + const Kokkos::View &Y, + const char transM[] = "N", + ScalarType alpha = karith::one(), + ScalarType beta = karith::zero()) const { + apply_impl(X, Y, transM, alpha, beta); + } //@} //! Set this preconditioner's parameters. diff --git a/sparse/src/KokkosSparse_Utils.hpp b/sparse/src/KokkosSparse_Utils.hpp index f3fbec1836..2b89c1a2f7 100644 --- a/sparse/src/KokkosSparse_Utils.hpp +++ b/sparse/src/KokkosSparse_Utils.hpp @@ -25,6 +25,7 @@ #include "KokkosSparse_CrsMatrix.hpp" #include "KokkosSparse_BsrMatrix.hpp" #include "Kokkos_Bitset.hpp" +#include "KokkosGraph_RCM.hpp" #ifdef KOKKOSKERNELS_HAVE_PARALLEL_GNUSORT #include @@ -2415,15 +2416,23 @@ void kk_extract_subblock_crsmatrix_sequential( * @tparam crsMat_t The type of the CRS matrix. * @param A [in] The square CrsMatrix. It is expected that column indices are * in ascending order + * @param UseRCMReordering [in] Boolean indicating whether applying (true) RCM + * reordering to diagonal blocks or not (false) (default: false) * @param DiagBlk_v [out] The vector of the extracted the CRS diagonal blocks * (1 <= the number of diagonal blocks <= A_nrows) + * @return a vector of lists of vertices in RCM order (a list per a diagonal + * block) if UseRCMReordering is true, or an empty vector if UseRCMReordering is + * false * * Usage Example: - * kk_extract_diagonal_blocks_crsmatrix_sequential(A_in, diagBlk_in_b); + * perm = kk_extract_diagonal_blocks_crsmatrix_sequential(A_in, diagBlk_out, + * UseRCMReordering); */ template -void kk_extract_diagonal_blocks_crsmatrix_sequential( - const crsMat_t &A, std::vector &DiagBlk_v) { +std::vector +kk_extract_diagonal_blocks_crsmatrix_sequential( + const crsMat_t &A, std::vector &DiagBlk_v, + bool UseRCMReordering = false) { using row_map_type = typename crsMat_t::row_map_type; using entries_type = typename crsMat_t::index_type; using values_type = typename crsMat_t::values_type; @@ -2437,6 +2446,7 @@ void kk_extract_diagonal_blocks_crsmatrix_sequential( using ordinal_type = typename crsMat_t::non_const_ordinal_type; using size_type = typename crsMat_t::non_const_size_type; + using value_type = typename crsMat_t::non_const_value_type; using offset_view1d_type = Kokkos::View; @@ -2463,8 +2473,12 @@ void kk_extract_diagonal_blocks_crsmatrix_sequential( throw std::runtime_error(os.str()); } + std::vector perm_v; + std::vector perm_h_v; + if (n_blocks == 1) { // One block case: simply shallow copy A to DiagBlk_v[0] + // Note: always not applying RCM reordering, for now DiagBlk_v[0] = crsMat_t(A); } else { // n_blocks > 1 @@ -2487,12 +2501,10 @@ void kk_extract_diagonal_blocks_crsmatrix_sequential( ? (A_nrows / n_blocks) : (A_nrows / n_blocks + 1); - std::vector row_map_v(n_blocks); - std::vector entries_v(n_blocks); - std::vector values_v(n_blocks); - std::vector row_map_h_v(n_blocks); - std::vector entries_h_v(n_blocks); - std::vector values_h_v(n_blocks); + if (UseRCMReordering) { + perm_v.resize(n_blocks); + perm_h_v.resize(n_blocks); + } ordinal_type blk_row_start = 0; // first row index of i-th diagonal block ordinal_type blk_col_start = 0; // first col index of i-th diagonal block @@ -2509,37 +2521,110 @@ void kk_extract_diagonal_blocks_crsmatrix_sequential( // First round: count i-th non-zeros or size of entries_v[i] and find // the first and last column indices at each row size_type blk_nnz = 0; - offset_view1d_type first("first", blk_nrows); // first position per row - offset_view1d_type last("last", blk_nrows); // last position per row + offset_view1d_type first( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "first"), + blk_nrows); // first position per row + offset_view1d_type last( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "last"), + blk_nrows); // last position per row kk_find_nnz_first_last_indices_subblock_crsmatrix_sequential( A_row_map_h, A_entries_h, blk_row_start, blk_col_start, blk_nrows, blk_ncols, blk_nnz, first, last); // Second round: extract - row_map_v[i] = out_row_map_type("row_map_v", blk_nrows + 1); - entries_v[i] = out_entries_type("entries_v", blk_nnz); - values_v[i] = out_values_type("values_v", blk_nnz); - row_map_h_v[i] = - out_row_map_hostmirror_type("row_map_h_v", blk_nrows + 1); - entries_h_v[i] = out_entries_hostmirror_type("entries_h_v", blk_nnz); - values_h_v[i] = out_values_hostmirror_type("values_h_v", blk_nnz); + out_row_map_type row_map( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "row_map"), + blk_nrows + 1); + out_entries_type entries( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "entries"), + blk_nnz); + out_values_type values( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "values"), blk_nnz); + out_row_map_hostmirror_type row_map_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "row_map_h"), + blk_nrows + 1); + out_entries_hostmirror_type entries_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "entries_h"), + blk_nnz); + out_values_hostmirror_type values_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "values_h"), + blk_nnz); kk_extract_subblock_crsmatrix_sequential( A_entries_h, A_values_h, blk_col_start, blk_nrows, blk_nnz, first, - last, row_map_h_v[i], entries_h_v[i], values_h_v[i]); + last, row_map_h, entries_h, values_h); + + if (!UseRCMReordering) { + Kokkos::deep_copy(row_map, row_map_h); + Kokkos::deep_copy(entries, entries_h); + Kokkos::deep_copy(values, values_h); + } else { + perm_h_v[i] = KokkosGraph::Experimental::graph_rcm< + Kokkos::DefaultHostExecutionSpace>(row_map_h, entries_h); + perm_v[i] = out_entries_type( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "perm_v"), + perm_h_v[i].extent(0)); + + out_row_map_hostmirror_type row_map_perm_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "row_map_perm_h"), + blk_nrows + 1); + out_entries_hostmirror_type entries_perm_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "entries_perm_h"), + blk_nnz); + out_values_hostmirror_type values_perm_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "values_perm_h"), + blk_nnz); + + out_entries_hostmirror_type reverseperm_h( + Kokkos::view_alloc(Kokkos::WithoutInitializing, "reverseperm_h"), + blk_nrows); + for (ordinal_type ii = 0; ii < blk_nrows; ii++) + reverseperm_h(perm_h_v[i](ii)) = ii; + + std::map colIdx_Value_rcm; + + // Loop through each row of the reordered matrix + size_type cnt = 0; + for (ordinal_type ii = 0; ii < blk_nrows; ii++) { + colIdx_Value_rcm.clear(); + // ii: reordered index + ordinal_type origRow = reverseperm_h( + ii); // get the original row idx of the reordered row idx, ii + for (size_type j = row_map_h(origRow); j < row_map_h(origRow + 1); + j++) { + ordinal_type origEi = entries_h(j); + value_type origV = values_h(j); + ordinal_type Ei = + perm_h_v[i](origEi); // get the reordered col idx of the + // original col idx, origEi + colIdx_Value_rcm[Ei] = origV; + } + row_map_perm_h(ii) = cnt; + for (typename std::map::iterator it = + colIdx_Value_rcm.begin(); + it != colIdx_Value_rcm.end(); ++it) { + entries_perm_h(cnt) = it->first; + values_perm_h(cnt) = it->second; + cnt++; + } + } + row_map_perm_h(blk_nrows) = cnt; - Kokkos::deep_copy(row_map_v[i], row_map_h_v[i]); - Kokkos::deep_copy(entries_v[i], entries_h_v[i]); - Kokkos::deep_copy(values_v[i], values_h_v[i]); + Kokkos::deep_copy(row_map, row_map_perm_h); + Kokkos::deep_copy(entries, entries_perm_h); + Kokkos::deep_copy(values, values_perm_h); + Kokkos::deep_copy(perm_v[i], perm_h_v[i]); + } DiagBlk_v[i] = crsMat_t("CrsMatrix", blk_nrows, blk_ncols, blk_nnz, - values_v[i], row_map_v[i], entries_v[i]); + values, row_map, entries); blk_row_start += blk_nrows; } // for (ordinal_type i = 0; i < n_blocks; i++) } // A_nrows >= 1 } // n_blocks > 1 + return perm_v; } } // namespace Impl diff --git a/sparse/src/KokkosSparse_Utils_mkl.hpp b/sparse/src/KokkosSparse_Utils_mkl.hpp index 7a8dd0cb22..a14e19f3cf 100644 --- a/sparse/src/KokkosSparse_Utils_mkl.hpp +++ b/sparse/src/KokkosSparse_Utils_mkl.hpp @@ -62,9 +62,15 @@ inline void mkl_internal_safe_call(sparse_status_t mkl_status, const char *name, } } +} // namespace Impl +} // namespace KokkosSparse + #define KOKKOSKERNELS_MKL_SAFE_CALL(call) \ KokkosSparse::Impl::mkl_internal_safe_call(call, #call, __FILE__, __LINE__) +namespace KokkosSparse { +namespace Impl { + inline sparse_operation_t mode_kk_to_mkl(char mode_kk) { switch (toupper(mode_kk)) { case 'N': return SPARSE_OPERATION_NON_TRANSPOSE; @@ -88,11 +94,58 @@ struct mkl_is_supported_value_type> : std::true_type {}; template <> struct mkl_is_supported_value_type> : std::true_type {}; +// Helper to: +// - define the MKL type equivalent to a given Kokkos scalar type +// - provide an easy implicit conversion to that MKL type +template +struct KokkosToMKLScalar { + static_assert(mkl_is_supported_value_type::value, + "Scalar type not supported by MKL"); + using type = Scalar; + KokkosToMKLScalar(Scalar val_) : val(val_) {} + operator Scalar() const { return val; } + Scalar val; +}; + +template <> +struct KokkosToMKLScalar> { + using type = MKL_Complex8; + KokkosToMKLScalar(Kokkos::complex val_) : val(val_) {} + operator MKL_Complex8() const { return {val.real(), val.imag()}; } + Kokkos::complex val; +}; + +template <> +struct KokkosToMKLScalar> { + using type = MKL_Complex16; + KokkosToMKLScalar(Kokkos::complex val_) : val(val_) {} + operator MKL_Complex16() const { return {val.real(), val.imag()}; } + Kokkos::complex val; +}; + +template +struct KokkosToOneMKLScalar { + // Note: we happen to use the same set of types in classic MKL and OneMKL. + // If that changes, update this logic. + static_assert(mkl_is_supported_value_type::value, + "Scalar type not supported by OneMKL"); + using type = Scalar; +}; + +template +struct KokkosToOneMKLScalar> { + static_assert(mkl_is_supported_value_type>::value, + "Scalar type not supported by OneMKL"); + using type = std::complex; +}; + // MKLSparseMatrix provides thin wrapper around MKL matrix handle // (sparse_matrix_t) and encapsulates MKL call dispatches related to details // like value_type, allowing simple client code in kernels. template class MKLSparseMatrix { + static_assert(mkl_is_supported_value_type::value, + "Provided value_type type not supported by MKL"); sparse_matrix_t mtx; public: @@ -100,11 +153,7 @@ class MKLSparseMatrix { // Constructs MKL sparse matrix from KK sparse views (m rows x n cols) inline MKLSparseMatrix(const MKL_INT num_rows, const MKL_INT num_cols, - MKL_INT *xadj, MKL_INT *adj, value_type *values) { - throw std::runtime_error( - "Scalar type used in MKLSparseMatrix is NOT " - "supported by MKL"); - } + MKL_INT *xadj, MKL_INT *adj, value_type *values) {} // Allows using MKLSparseMatrix directly in MKL calls inline operator sparse_matrix_t() const { return mtx; } @@ -112,11 +161,7 @@ class MKLSparseMatrix { // Exports MKL sparse matrix contents into KK views inline void export_data(MKL_INT &num_rows, MKL_INT &num_cols, MKL_INT *&rows_start, MKL_INT *&columns, - value_type *&values) { - throw std::runtime_error( - "Scalar type used in MKLSparseMatrix is NOT " - "supported by MKL"); - } + value_type *&values) {} inline void destroy() { KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_destroy(mtx)); diff --git a/sparse/src/KokkosSparse_coo2crs.hpp b/sparse/src/KokkosSparse_coo2crs.hpp index 45e54ce474..a29d818cb1 100644 --- a/sparse/src/KokkosSparse_coo2crs.hpp +++ b/sparse/src/KokkosSparse_coo2crs.hpp @@ -16,11 +16,6 @@ #ifndef _KOKKOSSPARSE_COO2CRS_HPP #define _KOKKOSSPARSE_COO2CRS_HPP -// The unorderedmap changes necessary for this to work -// have not made it into Kokkos 4.0.00 pr 4.0.01 will -// need to see if it happens in 4.1.00 to have a final -// version check here. -#if KOKKOS_VERSION >= 40099 || defined(DOXY) #include "KokkosSparse_CooMatrix.hpp" #include "KokkosSparse_CrsMatrix.hpp" @@ -99,5 +94,4 @@ auto coo2crs(KokkosSparse::CooMatrix= 40099 || defined(DOXY) #endif // _KOKKOSSPARSE_COO2CRS_HPP diff --git a/sparse/src/KokkosSparse_gauss_seidel_handle.hpp b/sparse/src/KokkosSparse_gauss_seidel_handle.hpp index 649229918d..624382ec5b 100644 --- a/sparse/src/KokkosSparse_gauss_seidel_handle.hpp +++ b/sparse/src/KokkosSparse_gauss_seidel_handle.hpp @@ -29,13 +29,22 @@ namespace KokkosSparse { enum GSAlgorithm { GS_DEFAULT, GS_PERMUTED, GS_TEAM, GS_CLUSTER, GS_TWOSTAGE }; enum GSDirection { GS_FORWARD, GS_BACKWARD, GS_SYMMETRIC }; -enum ClusteringAlgorithm { +enum struct ClusteringAlgorithm { CLUSTER_DEFAULT, CLUSTER_MIS2, CLUSTER_BALLOON, NUM_CLUSTERING_ALGORITHMS }; +static constexpr ClusteringAlgorithm CLUSTER_DEFAULT = + ClusteringAlgorithm::CLUSTER_DEFAULT; +static constexpr ClusteringAlgorithm CLUSTER_MIS2 = + ClusteringAlgorithm::CLUSTER_MIS2; +static constexpr ClusteringAlgorithm CLUSTER_BALLOON = + ClusteringAlgorithm::CLUSTER_BALLOON; +static constexpr ClusteringAlgorithm NUM_CLUSTERING_ALGORITHMS = + ClusteringAlgorithm::NUM_CLUSTERING_ALGORITHMS; + inline const char *getClusterAlgoName(ClusteringAlgorithm ca) { switch (ca) { case CLUSTER_BALLOON: return "Balloon"; diff --git a/sparse/src/KokkosSparse_gmres.hpp b/sparse/src/KokkosSparse_gmres.hpp index 31b736c393..b0b708a330 100644 --- a/sparse/src/KokkosSparse_gmres.hpp +++ b/sparse/src/KokkosSparse_gmres.hpp @@ -89,8 +89,9 @@ void gmres(KernelHandle* handle, AMatrix& A, BType& B, XType& X, "gmres: A size type must match KernelHandle entry " "type (aka size_type, and const doesn't matter)"); - static_assert(KokkosSparse::is_crs_matrix::value, - "gmres: A is not a CRS matrix."); + static_assert(KokkosSparse::is_crs_matrix::value || + KokkosSparse::Experimental::is_bsr_matrix::value, + "gmres: A is not a CRS or BSR matrix."); static_assert(Kokkos::is_view::value, "gmres: B is not a Kokkos::View."); static_assert(Kokkos::is_view::value, @@ -120,8 +121,10 @@ void gmres(KernelHandle* handle, AMatrix& A, BType& B, XType& X, using c_persist_t = typename KernelHandle::HandlePersistentMemorySpace; if ((X.extent(0) != B.extent(0)) || - (static_cast(A.numCols()) != static_cast(X.extent(0))) || - (static_cast(A.numRows()) != static_cast(B.extent(0)))) { + (static_cast(A.numPointCols()) != + static_cast(X.extent(0))) || + (static_cast(A.numPointRows()) != + static_cast(B.extent(0)))) { std::ostringstream os; os << "KokkosSparse::gmres: Dimensions do not match: " << ", A: " << A.numRows() << " x " << A.numCols() @@ -135,11 +138,20 @@ void gmres(KernelHandle* handle, AMatrix& A, BType& B, XType& X, const_handle_type tmp_handle(*handle); - using AMatrix_Internal = KokkosSparse::CrsMatrix< + using AMatrix_Bsr_Internal = KokkosSparse::Experimental::BsrMatrix< typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, typename AMatrix::device_type, Kokkos::MemoryTraits, typename AMatrix::const_size_type>; + using AMatrix_Internal = std::conditional_t< + KokkosSparse::is_crs_matrix::value, + KokkosSparse::CrsMatrix, + typename AMatrix::const_size_type>, + AMatrix_Bsr_Internal>; + using B_Internal = Kokkos::View< typename BType::const_value_type*, typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, @@ -154,9 +166,9 @@ void gmres(KernelHandle* handle, AMatrix& A, BType& B, XType& X, using Precond_Internal = Preconditioner; - AMatrix_Internal A_i = A; - B_Internal b_i = B; - X_Internal x_i = X; + AMatrix_Internal A_i(A); + B_Internal b_i = B; + X_Internal x_i = X; Precond_Internal* precond_i = reinterpret_cast(precond); diff --git a/sparse/src/KokkosSparse_spadd.hpp b/sparse/src/KokkosSparse_spadd.hpp index 74efed66bc..127400c752 100644 --- a/sparse/src/KokkosSparse_spadd.hpp +++ b/sparse/src/KokkosSparse_spadd.hpp @@ -19,25 +19,27 @@ #include "KokkosKernels_Handle.hpp" #include "KokkosKernels_helpers.hpp" -#include "KokkosSparse_spadd_symbolic_spec.hpp" +#include "KokkosBlas1_scal.hpp" #include "KokkosSparse_spadd_numeric_spec.hpp" +#include "KokkosSparse_spadd_symbolic_spec.hpp" namespace KokkosSparse { namespace Experimental { // Symbolic: count entries in each row in C to produce rowmap // kernel handle has information about whether it is sorted add or not. -template void spadd_symbolic( - KernelHandle* handle, const alno_row_view_t_ a_rowmap, + const ExecSpace &exec, KernelHandle *handle, + typename KernelHandle::const_nnz_lno_t m, // same type as column indices + typename KernelHandle::const_nnz_lno_t n, const alno_row_view_t_ a_rowmap, const alno_nnz_view_t_ a_entries, const blno_row_view_t_ b_rowmap, const blno_nnz_view_t_ b_entries, clno_row_view_t_ c_rowmap) // c_rowmap must already be allocated (doesn't // need to be initialized) { - typedef typename KernelHandle::HandleExecSpace ExecSpace; typedef typename KernelHandle::HandleTempMemorySpace MemSpace; typedef typename KernelHandle::HandlePersistentMemorySpace PersistentMemSpace; typedef typename Kokkos::Device DeviceType; @@ -51,49 +53,75 @@ void spadd_symbolic( ConstKernelHandle; ConstKernelHandle tmp_handle(*handle); - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_a_rowmap; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_a_entries; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_b_rowmap; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_b_entries; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_c_rowmap; - KokkosSparse::Impl::SPADD_SYMBOLIC:: - spadd_symbolic(&tmp_handle, - Internal_a_rowmap(a_rowmap.data(), a_rowmap.extent(0)), - Internal_a_entries(a_entries.data(), a_entries.extent(0)), - Internal_b_rowmap(b_rowmap.data(), b_rowmap.extent(0)), - Internal_b_entries(b_entries.data(), b_entries.extent(0)), - Internal_c_rowmap(c_rowmap.data(), c_rowmap.extent(0))); + + auto addHandle = handle->get_spadd_handle(); + bool useFallback = !addHandle->is_input_strict_crs(); + if (useFallback) { + KokkosSparse::Impl::SPADD_SYMBOLIC< + ExecSpace, ConstKernelHandle, Internal_a_rowmap, Internal_a_entries, + Internal_b_rowmap, Internal_b_entries, Internal_c_rowmap, false>:: + spadd_symbolic( + exec, &tmp_handle, m, n, + Internal_a_rowmap(a_rowmap.data(), a_rowmap.extent(0)), + Internal_a_entries(a_entries.data(), a_entries.extent(0)), + Internal_b_rowmap(b_rowmap.data(), b_rowmap.extent(0)), + Internal_b_entries(b_entries.data(), b_entries.extent(0)), + Internal_c_rowmap(c_rowmap.data(), c_rowmap.extent(0))); + } else { + KokkosSparse::Impl::SPADD_SYMBOLIC< + ExecSpace, ConstKernelHandle, Internal_a_rowmap, Internal_a_entries, + Internal_b_rowmap, Internal_b_entries, Internal_c_rowmap>:: + spadd_symbolic( + exec, &tmp_handle, m, n, + Internal_a_rowmap(a_rowmap.data(), a_rowmap.extent(0)), + Internal_a_entries(a_entries.data(), a_entries.extent(0)), + Internal_b_rowmap(b_rowmap.data(), b_rowmap.extent(0)), + Internal_b_entries(b_entries.data(), b_entries.extent(0)), + Internal_c_rowmap(c_rowmap.data(), c_rowmap.extent(0))); + } } -template +void spadd_symbolic(KernelHandle *handle, Args... args) { + spadd_symbolic(typename KernelHandle::HandleExecSpace{}, handle, args...); +} + +template -void spadd_numeric(KernelHandle* handle, const alno_row_view_t_ a_rowmap, +void spadd_numeric(const ExecSpace &exec, KernelHandle *handle, + typename KernelHandle::const_nnz_lno_t m, + typename KernelHandle::const_nnz_lno_t n, + const alno_row_view_t_ a_rowmap, const alno_nnz_view_t_ a_entries, const ascalar_nnz_view_t_ a_values, const ascalar_t_ alpha, const blno_row_view_t_ b_rowmap, @@ -101,7 +129,6 @@ void spadd_numeric(KernelHandle* handle, const alno_row_view_t_ a_rowmap, const bscalar_nnz_view_t_ b_values, const bscalar_t_ beta, const clno_row_view_t_ c_rowmap, clno_nnz_view_t_ c_entries, cscalar_nnz_view_t_ c_values) { - typedef typename KernelHandle::HandleExecSpace ExecSpace; typedef typename KernelHandle::HandleTempMemorySpace MemSpace; typedef typename KernelHandle::HandlePersistentMemorySpace PersistentMemSpace; typedef typename Kokkos::Device DeviceType; @@ -113,116 +140,183 @@ void spadd_numeric(KernelHandle* handle, const alno_row_view_t_ a_rowmap, typedef typename KokkosKernels::Experimental::KokkosKernelsHandle< c_size_t, c_lno_t, c_scalar_t, ExecSpace, MemSpace, PersistentMemSpace> ConstKernelHandle; - ConstKernelHandle tmp_handle(*handle); + ConstKernelHandle tmp_handle(*handle); // handle->exec_space is also copied - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_a_rowmap; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_a_entries; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_a_values; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_b_rowmap; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_b_entries; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_b_values; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_c_rowmap; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_c_entries; - typedef Kokkos::View::array_layout, - DeviceType, Kokkos::MemoryTraits > + DeviceType, Kokkos::MemoryTraits> Internal_c_values; - KokkosSparse::Impl::SPADD_NUMERIC:: - spadd_numeric(&tmp_handle, alpha, - Internal_a_rowmap(a_rowmap.data(), a_rowmap.extent(0)), - Internal_a_entries(a_entries.data(), a_entries.extent(0)), - Internal_a_values(a_values.data(), a_values.extent(0)), - beta, - Internal_b_rowmap(b_rowmap.data(), b_rowmap.extent(0)), - Internal_b_entries(b_entries.data(), b_entries.extent(0)), - Internal_b_values(b_values.data(), b_values.extent(0)), - Internal_c_rowmap(c_rowmap.data(), c_rowmap.extent(0)), - Internal_c_entries(c_entries.data(), c_entries.extent(0)), - Internal_c_values(c_values.data(), c_values.extent(0))); + + auto addHandle = handle->get_spadd_handle(); + bool useFallback = !addHandle->is_input_strict_crs(); + if (useFallback) { + KokkosSparse::Impl::SPADD_NUMERIC< + ExecSpace, ConstKernelHandle, Internal_a_rowmap, Internal_a_entries, + Internal_a_values, Internal_b_rowmap, Internal_b_entries, + Internal_b_values, Internal_c_rowmap, Internal_c_entries, + Internal_c_values, false>:: + spadd_numeric(exec, &tmp_handle, m, n, alpha, + Internal_a_rowmap(a_rowmap.data(), a_rowmap.extent(0)), + Internal_a_entries(a_entries.data(), a_entries.extent(0)), + Internal_a_values(a_values.data(), a_values.extent(0)), + beta, + Internal_b_rowmap(b_rowmap.data(), b_rowmap.extent(0)), + Internal_b_entries(b_entries.data(), b_entries.extent(0)), + Internal_b_values(b_values.data(), b_values.extent(0)), + Internal_c_rowmap(c_rowmap.data(), c_rowmap.extent(0)), + Internal_c_entries(c_entries.data(), c_entries.extent(0)), + Internal_c_values(c_values.data(), c_values.extent(0))); + } else { + KokkosSparse::Impl::SPADD_NUMERIC< + ExecSpace, ConstKernelHandle, Internal_a_rowmap, Internal_a_entries, + Internal_a_values, Internal_b_rowmap, Internal_b_entries, + Internal_b_values, Internal_c_rowmap, Internal_c_entries, + Internal_c_values>:: + spadd_numeric(exec, &tmp_handle, m, n, alpha, + Internal_a_rowmap(a_rowmap.data(), a_rowmap.extent(0)), + Internal_a_entries(a_entries.data(), a_entries.extent(0)), + Internal_a_values(a_values.data(), a_values.extent(0)), + beta, + Internal_b_rowmap(b_rowmap.data(), b_rowmap.extent(0)), + Internal_b_entries(b_entries.data(), b_entries.extent(0)), + Internal_b_values(b_values.data(), b_values.extent(0)), + Internal_c_rowmap(c_rowmap.data(), c_rowmap.extent(0)), + Internal_c_entries(c_entries.data(), c_entries.extent(0)), + Internal_c_values(c_values.data(), c_values.extent(0))); + } +} + +// one without an execution space arg +template +void spadd_numeric(KernelHandle *handle, Args... args) { + spadd_numeric(typename KernelHandle::HandleExecSpace{}, handle, args...); } } // namespace Experimental // Symbolic: count entries in each row in C to produce rowmap // kernel handle has information about whether it is sorted add or not. -template -void spadd_symbolic(KernelHandle* handle, const AMatrix& A, const BMatrix& B, - CMatrix& C) { +template +void spadd_symbolic(const ExecSpace &exec, KernelHandle *handle, + const AMatrix &A, const BMatrix &B, CMatrix &C) { using row_map_type = typename CMatrix::row_map_type::non_const_type; using entries_type = typename CMatrix::index_type::non_const_type; using values_type = typename CMatrix::values_type::non_const_type; + auto addHandle = handle->get_spadd_handle(); + // Create the row_map of C, no need to initialize it row_map_type row_mapC( - Kokkos::view_alloc(Kokkos::WithoutInitializing, "row map"), + Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, "row map"), A.numRows() + 1); - KokkosSparse::Experimental::spadd_symbolic(handle, A.graph.row_map, - A.graph.entries, B.graph.row_map, - B.graph.entries, row_mapC); + + // Shortcuts for special cases as they cause errors in some TPL + // implementations (e.g., cusparse and hipsparse) + if (!A.nnz()) { + Kokkos::deep_copy(exec, row_mapC, B.graph.row_map); + addHandle->set_c_nnz(B.graph.entries.extent(0)); + } else if (!B.nnz()) { + Kokkos::deep_copy(exec, row_mapC, A.graph.row_map); + addHandle->set_c_nnz(A.graph.entries.extent(0)); + } else { + KokkosSparse::Experimental::spadd_symbolic( + exec, handle, A.numRows(), A.numCols(), A.graph.row_map, + A.graph.entries, B.graph.row_map, B.graph.entries, row_mapC); + } // Now create and allocate the entries and values // views so we can build a graph and then matrix C // and subsequently construct C. - auto addHandle = handle->get_spadd_handle(); entries_type entriesC( - Kokkos::view_alloc(Kokkos::WithoutInitializing, "entries"), + Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, "entries"), addHandle->get_c_nnz()); // Finally since we already have the number of nnz handy // we can go ahead and allocate C's values and set them. - values_type valuesC(Kokkos::view_alloc(Kokkos::WithoutInitializing, "values"), - addHandle->get_c_nnz()); + values_type valuesC( + Kokkos::view_alloc(exec, Kokkos::WithoutInitializing, "values"), + addHandle->get_c_nnz()); C = CMatrix("matrix", A.numRows(), A.numCols(), addHandle->get_c_nnz(), valuesC, row_mapC, entriesC); } -// Symbolic: count entries in each row in C to produce rowmap +// Numeric: fill the column indices and values // kernel handle has information about whether it is sorted add or not. +template +void spadd_numeric(const ExecSpace &exec, KernelHandle *handle, + const AScalar alpha, const AMatrix &A, const BScalar beta, + const BMatrix &B, CMatrix &C) { + if (!A.nnz()) { + Kokkos::deep_copy(exec, C.graph.entries, B.graph.entries); + KokkosBlas::scal(exec, C.values, beta, B.values); + } else if (!B.nnz()) { + Kokkos::deep_copy(exec, C.graph.entries, A.graph.entries); + KokkosBlas::scal(exec, C.values, alpha, A.values); + } else { + KokkosSparse::Experimental::spadd_numeric( + exec, handle, A.numRows(), A.numCols(), A.graph.row_map, + A.graph.entries, A.values, alpha, B.graph.row_map, B.graph.entries, + B.values, beta, C.graph.row_map, C.graph.entries, C.values); + } +} + +// One without an explicit execution space argument +template +void spadd_symbolic(KernelHandle *handle, const AMatrix &A, const BMatrix &B, + CMatrix &C) { + spadd_symbolic(typename AMatrix::execution_space{}, handle, A, B, C); +} + template -void spadd_numeric(KernelHandle* handle, const AScalar alpha, const AMatrix& A, - const BScalar beta, const BMatrix& B, CMatrix& C) { - KokkosSparse::Experimental::spadd_numeric( - handle, A.graph.row_map, A.graph.entries, A.values, alpha, - B.graph.row_map, B.graph.entries, B.values, beta, C.graph.row_map, - C.graph.entries, C.values); +void spadd_numeric(KernelHandle *handle, const AScalar alpha, const AMatrix &A, + const BScalar beta, const BMatrix &B, CMatrix &C) { + spadd_numeric(typename AMatrix::execution_space{}, handle, alpha, A, beta, B, + C); } } // namespace KokkosSparse diff --git a/sparse/src/KokkosSparse_spadd_handle.hpp b/sparse/src/KokkosSparse_spadd_handle.hpp index 2902550d6a..760f912c6d 100644 --- a/sparse/src/KokkosSparse_spadd_handle.hpp +++ b/sparse/src/KokkosSparse_spadd_handle.hpp @@ -32,8 +32,46 @@ class SPADDHandle { typedef typename lno_row_view_t_::non_const_value_type size_type; typedef ExecutionSpace execution_space; +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + struct SpaddCusparseData { + size_t nbytes; + void* workspace; + cusparseMatDescr_t descrA, descrB, descrC; + + SpaddCusparseData() + : nbytes(0), + workspace(nullptr), + descrA(nullptr), + descrB(nullptr), + descrC(nullptr) {} + + ~SpaddCusparseData() { + Kokkos::kokkos_free(workspace); + cusparseDestroyMatDescr(descrA); + cusparseDestroyMatDescr(descrB); + cusparseDestroyMatDescr(descrC); + } + }; +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + struct SpaddRocsparseData { + rocsparse_mat_descr descrA, descrB, descrC; + + SpaddRocsparseData() : descrA(nullptr), descrB(nullptr), descrC(nullptr) {} + + ~SpaddRocsparseData() { + rocsparse_destroy_mat_descr(descrA); + rocsparse_destroy_mat_descr(descrB); + rocsparse_destroy_mat_descr(descrC); + } + }; +#endif + private: - bool input_sorted; + // if both are true, the input matrices are strict CRS + bool input_sorted; // column indices in a row are sorted + bool input_merged; // column indices in a row are unique (i.e., merged) size_type result_nnz_size; @@ -76,11 +114,20 @@ class SPADDHandle { int get_sort_option() { return this->sort_option; } +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + SpaddCusparseData cusparseData; +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + SpaddRocsparseData rocsparseData; +#endif + /** * \brief Default constructor. */ - SPADDHandle(bool input_is_sorted) + SPADDHandle(bool input_is_sorted, bool input_is_merged = false) : input_sorted(input_is_sorted), + input_merged(input_is_merged), result_nnz_size(0), called_symbolic(false), called_numeric(false) {} @@ -95,6 +142,8 @@ class SPADDHandle { void set_call_numeric(bool call = true) { this->called_numeric = call; } bool is_input_sorted() { return input_sorted; } + bool is_input_merged() { return input_merged; } + bool is_input_strict_crs() { return input_sorted && input_merged; } }; } // namespace KokkosSparse diff --git a/sparse/src/KokkosSparse_spiluk.hpp b/sparse/src/KokkosSparse_spiluk.hpp index 1bf78abe5e..b3644a8709 100644 --- a/sparse/src/KokkosSparse_spiluk.hpp +++ b/sparse/src/KokkosSparse_spiluk.hpp @@ -530,7 +530,6 @@ void spiluk_numeric(KernelHandle* handle, A_entries_i, A_values_i, L_rowmap_i, L_entries_i, L_values_i, U_rowmap_i, U_entries_i, U_values_i); - } // spiluk_numeric template class SPILUKHandle { public: - typedef ExecutionSpace HandleExecSpace; - typedef TemporaryMemorySpace HandleTempMemorySpace; - typedef PersistentMemorySpace HandlePersistentMemorySpace; + using HandleExecSpace = ExecutionSpace; + using HandleTempMemorySpace = TemporaryMemorySpace; + using HandlePersistentMemorySpace = PersistentMemorySpace; - typedef ExecutionSpace execution_space; - typedef HandlePersistentMemorySpace memory_space; + using execution_space = ExecutionSpace; + using memory_space = HandlePersistentMemorySpace; - typedef typename std::remove_const::type size_type; - typedef const size_type const_size_type; + using TeamPolicy = Kokkos::TeamPolicy; + using RangePolicy = Kokkos::RangePolicy; - typedef typename std::remove_const::type nnz_lno_t; - typedef const nnz_lno_t const_nnz_lno_t; + using size_type = typename std::remove_const::type; + using const_size_type = const size_type; - typedef typename std::remove_const::type nnz_scalar_t; - typedef const nnz_scalar_t const_nnz_scalar_t; + using nnz_lno_t = typename std::remove_const::type; + using const_nnz_lno_t = const nnz_lno_t; - typedef typename Kokkos::View - nnz_row_view_t; + using nnz_scalar_t = typename std::remove_const::type; + using const_nnz_scalar_t = const nnz_scalar_t; - typedef typename Kokkos::View - nnz_lno_view_t; + using nnz_row_view_t = Kokkos::View; - typedef typename Kokkos::View - nnz_row_view_host_t; + using nnz_lno_view_t = Kokkos::View; - typedef typename Kokkos::View - nnz_lno_view_host_t; + using nnz_value_view_t = + typename Kokkos::View; - typedef typename std::make_signed< - typename nnz_row_view_t::non_const_value_type>::type signed_integral_t; - typedef Kokkos::View - signed_nnz_lno_view_t; + using nnz_row_view_host_t = + typename Kokkos::View; - typedef Kokkos::View - work_view_t; + using nnz_lno_view_host_t = + typename Kokkos::View; + + using signed_integral_t = typename std::make_signed< + typename nnz_row_view_t::non_const_value_type>::type; + using signed_nnz_lno_view_t = + Kokkos::View; + + using work_view_t = Kokkos::View; private: nnz_row_view_t level_list; // level IDs which the rows belong to @@ -95,6 +95,7 @@ class SPILUKHandle { size_type nlevels; size_type nnzL; size_type nnzU; + size_type block_size; size_type level_maxrows; // max. number of rows among levels size_type level_maxrowsperchunk; // max.number of rows among chunks among levels @@ -109,7 +110,7 @@ class SPILUKHandle { public: SPILUKHandle(SPILUKAlgorithm choice, const size_type nrows_, const size_type nnzL_, const size_type nnzU_, - bool symbolic_complete_ = false) + const size_type block_size_ = 0, bool symbolic_complete_ = false) : level_list(), level_idx(), level_ptr(), @@ -121,6 +122,7 @@ class SPILUKHandle { nlevels(0), nnzL(nnzL_), nnzU(nnzU_), + block_size(block_size_), level_maxrows(0), level_maxrowsperchunk(0), symbolic_complete(symbolic_complete_), @@ -128,21 +130,28 @@ class SPILUKHandle { team_size(-1), vector_size(-1) {} - void reset_handle(const size_type nrows_, const size_type nnzL_, - const size_type nnzU_) { + void reset_handle( + const size_type nrows_, const size_type nnzL_, const size_type nnzU_, + const size_type block_size_ = Kokkos::ArithTraits::max()) { set_nrows(nrows_); set_num_levels(0); set_nnzL(nnzL_); set_nnzU(nnzU_); + // user likely does not want to reset block size to 0, so set default + // to size_type::max + if (block_size_ != Kokkos::ArithTraits::max()) { + set_block_size(block_size_); + } set_level_maxrows(0); set_level_maxrowsperchunk(0); - level_list = nnz_row_view_t("level_list", nrows_), - level_idx = nnz_lno_view_t("level_idx", nrows_), - level_ptr = nnz_lno_view_t("level_ptr", nrows_ + 1), - hlevel_ptr = nnz_lno_view_host_t("hlevel_ptr", nrows_ + 1), - level_nchunks = nnz_lno_view_host_t(), - level_nrowsperchunk = nnz_lno_view_host_t(), reset_symbolic_complete(), + level_list = nnz_row_view_t("level_list", nrows_); + level_idx = nnz_lno_view_t("level_idx", nrows_); + level_ptr = nnz_lno_view_t("level_ptr", nrows_ + 1); + hlevel_ptr = nnz_lno_view_host_t("hlevel_ptr", nrows_ + 1); + level_nchunks = nnz_lno_view_host_t(); + level_nrowsperchunk = nnz_lno_view_host_t(); iw = work_view_t(); + reset_symbolic_complete(); } virtual ~SPILUKHandle(){}; @@ -205,6 +214,14 @@ class SPILUKHandle { KOKKOS_INLINE_FUNCTION void set_nnzU(const size_type nnzU_) { this->nnzU = nnzU_; } + KOKKOS_INLINE_FUNCTION + size_type get_block_size() const { return block_size; } + + KOKKOS_INLINE_FUNCTION + void set_block_size(const size_type block_size_) { + this->block_size = block_size_; + } + KOKKOS_INLINE_FUNCTION size_type get_level_maxrows() const { return level_maxrows; } @@ -223,6 +240,8 @@ class SPILUKHandle { bool is_symbolic_complete() const { return symbolic_complete; } + bool is_block_enabled() const { return block_size > 0; } + size_type get_num_levels() const { return nlevels; } void set_num_levels(size_type nlevels_) { this->nlevels = nlevels_; } @@ -236,9 +255,6 @@ class SPILUKHandle { int get_vector_size() const { return this->vector_size; } void print_algorithm() { - if (algm == SPILUKAlgorithm::SEQLVLSCHD_RP) - std::cout << "SEQLVLSCHD_RP" << std::endl; - if (algm == SPILUKAlgorithm::SEQLVLSCHD_TP1) std::cout << "SEQLVLSCHD_TP1" << std::endl; @@ -249,19 +265,6 @@ class SPILUKHandle { } */ } - - inline SPILUKAlgorithm StringToSPILUKAlgorithm(std::string &name) { - if (name == "SPILUK_DEFAULT") - return SPILUKAlgorithm::SEQLVLSCHD_RP; - else if (name == "SPILUK_RANGEPOLICY") - return SPILUKAlgorithm::SEQLVLSCHD_RP; - else if (name == "SPILUK_TEAMPOLICY1") - return SPILUKAlgorithm::SEQLVLSCHD_TP1; - /*else if(name=="SPILUK_TEAMPOLICY2") return - * SPILUKAlgorithm::SEQLVLSCHED_TP2;*/ - else - throw std::runtime_error("Invalid SPILUKAlgorithm name"); - } }; } // namespace Experimental diff --git a/sparse/src/KokkosSparse_spmv.hpp b/sparse/src/KokkosSparse_spmv.hpp index bd038813d1..2391291695 100644 --- a/sparse/src/KokkosSparse_spmv.hpp +++ b/sparse/src/KokkosSparse_spmv.hpp @@ -22,7 +22,7 @@ #define KOKKOSSPARSE_SPMV_HPP_ #include "KokkosKernels_helpers.hpp" -#include "KokkosKernels_Controls.hpp" +#include "KokkosSparse_spmv_handle.hpp" #include "KokkosSparse_spmv_spec.hpp" #include "KokkosSparse_spmv_struct_spec.hpp" #include "KokkosSparse_spmv_bsrmatrix_spec.hpp" @@ -40,816 +40,47 @@ struct RANK_ONE {}; struct RANK_TWO {}; } // namespace -/// \brief Kokkos sparse matrix-vector multiply on single -/// vectors (RANK_ONE tag). Computes y := alpha*Op(A)*x + beta*y, where Op(A) is -/// controlled by mode (see below). -/// -/// \tparam ExecutionSpace A Kokkos execution space. Must be able to access -/// the memory spaces of A, x, and y. -/// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank-1 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank-1 Kokkos::View and its rank must match that of XVector -/// -/// \param space [in] The execution space instance on which to run the -/// kernel. -/// \param controls [in] kokkos-kernels control structure. -/// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. -/// \param x [in] A vector to multiply on the left by A. -/// \param beta [in] Scalar multiplier for the vector y. -/// \param y [in/out] Result vector. -/// \param tag RANK_ONE dispatch -#ifdef DOXY // documentation version - don't separately document SFINAE - // specializations for BSR and CRS -template -#else -template ::value>::type* = nullptr> -#endif -void spmv(const ExecutionSpace& space, - KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, - [[maybe_unused]] const RANK_ONE& tag) { - - // Make sure that x and y are Views. - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: XVector must be a Kokkos::View."); - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: YVector must be a Kokkos::View."); - // Make sure A, x, y are accessible to ExecutionSpace - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: AMatrix must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: XVector must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: YVector must be accessible from ExecutionSpace"); - -// Make sure that x and y have the same rank. -// Make sure that x (and therefore y) is rank 1. -#if (KOKKOS_VERSION >= 40100) - static_assert(XVector::rank() == YVector::rank(), - "KokkosSparse::spmv: Vector ranks do not match."); - - static_assert(XVector::rank() == 1, - "KokkosSparse::spmv: Both Vector inputs must have rank 1 " - "in order to call this specialization of spmv."); -#else - static_assert( - static_cast(XVector::rank) == static_cast(YVector::rank), - "KokkosSparse::spmv: Vector ranks do not match."); - static_assert(static_cast(XVector::rank) == 1, - "KokkosSparse::spmv: Both Vector inputs must have rank 1 " - "in order to call this specialization of spmv."); -#endif - // Make sure that y is non-const. - static_assert(std::is_same::value, - "KokkosSparse::spmv: Output Vector must be non-const."); - - // Check compatibility of dimensions at run time. - if ((mode[0] == NoTranspose[0]) || (mode[0] == Conjugate[0])) { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols()) > static_cast(x.extent(0))) || - (static_cast(A.numRows()) > static_cast(y.extent(0)))) { - std::ostringstream os; - os << "KokkosSparse::spmv: Dimensions do not match: " - << ", A: " << A.numRows() << " x " << A.numCols() - << ", x: " << x.extent(0) << " x " << x.extent(1) - << ", y: " << y.extent(0) << " x " << y.extent(1); - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } else { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols()) > static_cast(y.extent(0))) || - (static_cast(A.numRows()) > static_cast(x.extent(0)))) { - std::ostringstream os; - os << "KokkosSparse::spmv: Dimensions do not match (transpose): " - << ", A: " << A.numRows() << " x " << A.numCols() - << ", x: " << x.extent(0) << " x " << x.extent(1) - << ", y: " << y.extent(0) << " x " << y.extent(1); - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } - - typedef KokkosSparse::CrsMatrix< - typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, - typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::const_size_type> - AMatrix_Internal; - - typedef Kokkos::View< - typename XVector::const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_Internal; - - typedef Kokkos::View< - typename YVector::non_const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > - YVector_Internal; - - AMatrix_Internal A_i = A; - XVector_Internal x_i = x; - YVector_Internal y_i = y; - - if (alpha == Kokkos::ArithTraits::zero() || A_i.numRows() == 0 || - A_i.numCols() == 0 || A_i.nnz() == 0) { - // This is required to maintain semantics of KokkosKernels native SpMV: - // if y contains NaN but beta = 0, the result y should be filled with 0. - // For example, this is useful for passing in uninitialized y and beta=0. - if (beta == Kokkos::ArithTraits::zero()) - Kokkos::deep_copy(space, y_i, Kokkos::ArithTraits::zero()); - else - KokkosBlas::scal(space, y_i, beta, y_i); - return; - } - - // Whether to call KokkosKernel's native implementation, even if a TPL impl is - // available - bool useFallback = controls.isParameter("algorithm") && - (controls.getParameter("algorithm") != "tpl"); - -#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - // cuSPARSE does not support the conjugate mode (C) - if constexpr (std::is_same_v || - std::is_same_v) { - useFallback = useFallback || (mode[0] == Conjugate[0]); - } - // cuSPARSE 12 requires that the output (y) vector is 16-byte aligned for all - // scalar types -#if defined(CUSPARSE_VER_MAJOR) && (CUSPARSE_VER_MAJOR == 12) - uintptr_t yptr = uintptr_t((void*)y.data()); - if (yptr % 16 != 0) useFallback = true; -#endif -#endif - -#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE - if (std::is_same::value) { - useFallback = useFallback || (mode[0] != NoTranspose[0]); - } -#endif - -#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL - if (std::is_same_v) { - useFallback = useFallback || (mode[0] == Conjugate[0]); - } -#ifdef KOKKOS_ENABLE_SYCL - if (std::is_same_v) { - useFallback = useFallback || (mode[0] == Conjugate[0]); - } -#endif -#endif - - if (useFallback) { - // Explicitly call the non-TPL SPMV implementation - std::string label = - "KokkosSparse::spmv[NATIVE," + - Kokkos::ArithTraits< - typename AMatrix_Internal::non_const_value_type>::name() + - "]"; - Kokkos::Profiling::pushRegion(label); - Impl::SPMV::spmv(space, controls, mode, alpha, A_i, - x_i, beta, y_i); - Kokkos::Profiling::popRegion(); - } else { - // note: the cuSPARSE spmv wrapper defines a profiling region, so one is not - // needed here. - Impl::SPMV::spmv(space, controls, mode, alpha, A_i, x_i, - beta, y_i); - } -} - -/// \brief Kokkos sparse matrix-vector multiply on single -/// vector (RANK_ONE tag). Computes y := alpha*Op(A)*x + beta*y, where Op(A) is -/// controlled by mode (see below). -/// -/// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank-1 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank-1 Kokkos::View and its rank must match that of XVector -/// -/// \param controls [in] kokkos-kernels control structure. -/// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. -/// \param x [in] A vector to multiply on the left by A. -/// \param beta [in] Scalar multiplier for the vector y. -/// \param y [in/out] Result vector. -/// \param tag RANK_ONE dispatch -#ifdef DOXY // documentation version -template -#else -template ::value>::type* = nullptr> -#endif -void spmv(KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, const RANK_ONE& tag) { - spmv(typename AMatrix::execution_space{}, controls, mode, alpha, A, x, beta, - y, tag); -} - -#ifndef DOXY // hide SFINAE specialization for BSR -template ::value>::type* = nullptr> -void spmv(const ExecutionSpace& space, - KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, - [[maybe_unused]] const RANK_ONE& tag) { - // Make sure that x and y are Views. - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: XVector must be a Kokkos::View."); - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: YVector must be a Kokkos::View."); - // Make sure A, x, y are accessible to ExecutionSpace - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: AMatrix must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: XVector must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: YVector must be accessible from ExecutionSpace"); - // Make sure that x and y have the same rank. -#if (KOKKOS_VERSION >= 40100) - static_assert(XVector::rank() == YVector::rank(), - "KokkosSparse::spmv: Vector ranks do not match."); -#else - static_assert( - static_cast(XVector::rank) == static_cast(YVector::rank), - "KokkosSparse::spmv: Vector ranks do not match."); -#endif - // Make sure that x (and therefore y) is rank 1. - static_assert(static_cast(XVector::rank) == 1, - "KokkosSparse::spmv: Both Vector inputs must have rank 1 " - "in order to call this specialization of spmv."); - // Make sure that y is non-const. - static_assert(std::is_same::value, - "KokkosSparse::spmv: Output Vector must be non-const."); - - // - if (A.blockDim() == 1) { - KokkosSparse::CrsMatrix< - typename AMatrix::value_type, typename AMatrix::ordinal_type, - typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::size_type> - Acrs("bsr_to_crs", A.numCols(), A.values, A.graph); - KokkosSparse::spmv(space, controls, mode, alpha, Acrs, x, beta, y, - RANK_ONE()); - return; - } - // Check compatibility of dimensions at run time. - if ((mode[0] == NoTranspose[0]) || (mode[0] == Conjugate[0])) { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols() * A.blockDim()) != - static_cast(x.extent(0))) || - (static_cast(A.numRows() * A.blockDim()) != - static_cast(y.extent(0)))) { - std::ostringstream os; - os << "KokkosSparse::spmv (BsrMatrix): Dimensions do not match: " - << ", A: " << A.numRows() * A.blockDim() << " x " - << A.numCols() * A.blockDim() << ", x: " << x.extent(0) << " x " - << x.extent(1) << ", y: " << y.extent(0) << " x " << y.extent(1); - - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } else { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols() * A.blockDim()) != - static_cast(y.extent(0))) || - (static_cast(A.numRows() * A.blockDim()) != - static_cast(x.extent(0)))) { - std::ostringstream os; - os << "KokkosSparse::spmv (BsrMatrix): Dimensions do not match " - "(transpose): " - << ", A: " << A.numRows() * A.blockDim() << " x " - << A.numCols() * A.blockDim() << ", x: " << x.extent(0) << " x " - << x.extent(1) << ", y: " << y.extent(0) << " x " << y.extent(1); - - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } - // - typedef KokkosSparse::Experimental::BsrMatrix< - typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, - typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::const_size_type> - AMatrix_Internal; - - typedef Kokkos::View< - typename XVector::const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_Internal; - - typedef Kokkos::View< - typename YVector::non_const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > - YVector_Internal; - - AMatrix_Internal A_i(A); - XVector_Internal x_i(x); - YVector_Internal y_i(y); - - if (alpha == Kokkos::ArithTraits::zero() || A_i.numRows() == 0 || - A_i.numCols() == 0 || A_i.nnz() == 0) { - // This is required to maintain semantics of KokkosKernels native SpMV: - // if y contains NaN but beta = 0, the result y should be filled with 0. - // For example, this is useful for passing in uninitialized y and beta=0. - if (beta == Kokkos::ArithTraits::zero()) - Kokkos::deep_copy(space, y_i, Kokkos::ArithTraits::zero()); - else - KokkosBlas::scal(space, y_i, beta, y_i); - return; - } - - // - // Whether to call KokkosKernel's native implementation, even if a TPL impl is - // available - bool useFallback = controls.isParameter("algorithm") && - (controls.getParameter("algorithm") != "tpl"); - -#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - // cuSPARSE does not support the modes (C), (T), (H) - if (std::is_same::value || - std::is_same::value) { - useFallback = useFallback || (mode[0] != NoTranspose[0]); - } -#endif - -#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL - if (std::is_same::value) { - useFallback = useFallback || (mode[0] == Conjugate[0]); - } -#endif - -#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE - // rocSparse does not support the modes (C), (T), (H) - if constexpr (std::is_same_v) { - useFallback = useFallback || (mode[0] != NoTranspose[0]); - } -#endif - - if (useFallback) { - // Explicitly call the non-TPL SPMV_BSRMATRIX implementation - std::string label = - "KokkosSparse::spmv[NATIVE,BSRMATRIX," + - Kokkos::ArithTraits< - typename AMatrix_Internal::non_const_value_type>::name() + - "]"; - Kokkos::Profiling::pushRegion(label); - Experimental::Impl::SPMV_BSRMATRIX::spmv_bsrmatrix(space, controls, - mode, alpha, A_i, - x_i, beta, y_i); - Kokkos::Profiling::popRegion(); - } else { - constexpr bool tpl_spec_avail = - KokkosSparse::Experimental::Impl::spmv_bsrmatrix_tpl_spec_avail< - ExecutionSpace, AMatrix_Internal, XVector_Internal, - YVector_Internal>::value; - - constexpr bool eti_spec_avail = - tpl_spec_avail - ? KOKKOSKERNELS_IMPL_COMPILE_LIBRARY /* force FALSE in app/test */ - : KokkosSparse::Experimental::Impl::spmv_bsrmatrix_eti_spec_avail< - ExecutionSpace, AMatrix_Internal, XVector_Internal, - YVector_Internal>::value; - - Experimental::Impl::SPMV_BSRMATRIX< - ExecutionSpace, AMatrix_Internal, XVector_Internal, YVector_Internal, - tpl_spec_avail, eti_spec_avail>::spmv_bsrmatrix(space, controls, mode, - alpha, A_i, x_i, beta, - y_i); - } -} - -template ::value>::type* = nullptr> -void spmv(KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, const RANK_ONE& tag) { - spmv(typename AMatrix::execution_space{}, controls, mode, alpha, A, x, beta, - y, tag); -} -#endif // ifndef DOXY - -namespace Impl { -template -struct SPMV2D1D { - static bool spmv2d1d(const char mode[], const AlphaType& alpha, - const AMatrix& A, const XVector& x, const BetaType& beta, - const YVector& y); - - template - static bool spmv2d1d(const ExecutionSpace& space, const char mode[], - const AlphaType& alpha, const AMatrix& A, - const XVector& x, const BetaType& beta, - const YVector& y); -}; - -#if defined(KOKKOSKERNELS_INST_LAYOUTSTRIDE) || !defined(KOKKOSKERNELS_ETI_ONLY) -template -struct SPMV2D1D { - static bool spmv2d1d(const char mode[], const AlphaType& alpha, - const AMatrix& A, const XVector& x, const BetaType& beta, - const YVector& y) { - spmv(typename AMatrix::execution_space{}, mode, alpha, A, x, beta, y); - return true; - } - - template - static bool spmv2d1d(const ExecutionSpace& space, const char mode[], - const AlphaType& alpha, const AMatrix& A, - const XVector& x, const BetaType& beta, - const YVector& y) { - spmv(space, mode, alpha, A, x, beta, y); - return true; - } -}; - -#else - -template -struct SPMV2D1D { - static bool spmv2d1d(const char /*mode*/[], const AlphaType& /*alpha*/, - const AMatrix& /*A*/, const XVector& /*x*/, - const BetaType& /*beta*/, const YVector& /*y*/) { - return false; - } - - template - static bool spmv2d1d(const ExecutionSpace& /* space */, const char /*mode*/[], - const AlphaType& /*alpha*/, const AMatrix& /*A*/, - const XVector& /*x*/, const BetaType& /*beta*/, - const YVector& /*y*/) { - return false; - } -}; -#endif - -#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || !defined(KOKKOSKERNELS_ETI_ONLY) -template -struct SPMV2D1D { - static bool spmv2d1d(const char mode[], const AlphaType& alpha, - const AMatrix& A, const XVector& x, const BetaType& beta, - const YVector& y) { - spmv(typename AMatrix::execution_space{}, mode, alpha, A, x, beta, y); - return true; - } - - template - static bool spmv2d1d(const ExecutionSpace& space, const char mode[], - const AlphaType& alpha, const AMatrix& A, - const XVector& x, const BetaType& beta, - const YVector& y) { - spmv(space, mode, alpha, A, x, beta, y); - return true; - } -}; - -#else - -template -struct SPMV2D1D { - static bool spmv2d1d(const char /*mode*/[], const AlphaType& /*alpha*/, - const AMatrix& /*A*/, const XVector& /*x*/, - const BetaType& /*beta*/, const YVector& /*y*/) { - return false; - } - - template - static bool spmv2d1d(const ExecutionSpace& /* space */, const char /*mode*/[], - const AlphaType& /*alpha*/, const AMatrix& /*A*/, - const XVector& /*x*/, const BetaType& /*beta*/, - const YVector& /*y*/) { - return false; - } -}; -#endif - -#if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || !defined(KOKKOSKERNELS_ETI_ONLY) -template -struct SPMV2D1D { - static bool spmv2d1d(const char mode[], const AlphaType& alpha, - const AMatrix& A, const XVector& x, const BetaType& beta, - const YVector& y) { - spmv(typename AMatrix::execution_space{}, mode, alpha, A, x, beta, y); - return true; - } - - template - static bool spmv2d1d(const ExecutionSpace& space, const char mode[], - const AlphaType& alpha, const AMatrix& A, - const XVector& x, const BetaType& beta, - const YVector& y) { - spmv(space, mode, alpha, A, x, beta, y); - return true; - } -}; - -#else - -template -struct SPMV2D1D { - static bool spmv2d1d(const char /*mode*/[], const AlphaType& /*alpha*/, - const AMatrix& /*A*/, const XVector& /*x*/, - const BetaType& /*beta*/, const YVector& /*y*/) { - return false; - } - - template - static bool spmv2d1d(const ExecutionSpace& /* space */, const char /*mode*/[], - const AlphaType& /*alpha*/, const AMatrix& /*A*/, - const XVector& /*x*/, const BetaType& /*beta*/, - const YVector& /*y*/) { - return false; - } -}; -#endif -} // namespace Impl - -template -using SPMV2D1D - [[deprecated("KokkosSparse::SPMV2D1D is not part of the public interface - " - "use KokkosSparse::spmv instead")]] = - Impl::SPMV2D1D; - -/// \brief Kokkos sparse matrix-vector multiply on multivectors -/// (RANK_TWO tag). Computes y := alpha*Op(A)*x + beta*y, where Op(A) is +// clang-format off +/// \brief Kokkos sparse matrix-vector multiply. +/// Computes y := alpha*Op(A)*x + beta*y, where Op(A) is /// controlled by mode (see below). /// /// \tparam ExecutionSpace A Kokkos execution space. Must be able to access -/// the memory spaces of A, x, and y. +/// the memory spaces of A, x, and y. Must match Handle::ExecutionSpaceType. +/// \tparam Handle Specialization of KokkosSparse::SPMVHandle /// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank-2 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank-2 Kokkos::View and its rank must match that of XVector +/// YVector::value_type. +/// \tparam AMatrix A KokkosSparse::CrsMatrix, or +/// KokkosSparse::Experimental::BsrMatrix. Must be identical to Handle::AMatrixType. +/// \tparam XVector Type of x, must be a rank-1 or 2 Kokkos::View. Must be identical to Handle::XVectorType. +/// \tparam BetaType Type of coefficient beta. Must be +/// convertible to YVector::value_type. +/// \tparam YVector Type of y, must be a rank-1 or 2 Kokkos::View and its rank must match that of XVector. Must +/// be identical to Handle::YVectorType. /// /// \param space [in] The execution space instance on which to run the /// kernel. -/// \param controls [in] kokkos-kernels control structure. +/// \param handle [in/out] a pointer to a KokkosSparse::SPMVHandle. On the first call to spmv with +/// a given handle instance, the handle's internal data will be initialized automatically. +/// On all later calls to spmv, this internal data will be reused. /// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. -/// \param x [in] A vector to multiply on the left by A. -/// \param beta [in] Scalar multiplier for the vector y. -/// \param y [in/out] Result vector. -/// \param tag RANK_TWO dispatch -#ifdef DOXY // documentation version -template -#else -template ::value>::type* = nullptr> -#endif -void spmv(const ExecutionSpace& space, - KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, - [[maybe_unused]] const RANK_TWO& tag) { - // Make sure that x and y are Views. - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: XVector must be a Kokkos::View."); - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: YVector must be a Kokkos::View."); - // Make sure A, x, y are accessible to ExecutionSpace - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: AMatrix must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: XVector must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: YVector must be accessible from ExecutionSpace"); -// Make sure that x and y have the same rank. -#if (KOKKOS_VERSION >= 40100) - static_assert(XVector::rank() == YVector::rank(), - "KokkosSparse::spmv: Vector ranks do not match."); -#else - static_assert( - static_cast(XVector::rank) == static_cast(YVector::rank), - "KokkosSparse::spmv: Vector ranks do not match."); -#endif - // Make sure that x (and therefore y) is rank 2. - static_assert(static_cast(XVector::rank) == 2, - "KokkosSparse::spmv: Both Vector inputs must have rank 2 " - "in order to call this specialization of spmv."); - // Make sure that y is non-const. - static_assert(std::is_same::value, - "KokkosSparse::spmv: Output Vector must be non-const."); - - // Check compatibility of dimensions at run time. - if ((mode[0] == NoTranspose[0]) || (mode[0] == Conjugate[0])) { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols()) > static_cast(x.extent(0))) || - (static_cast(A.numRows()) > static_cast(y.extent(0)))) { - std::ostringstream os; - os << "KokkosBlas::spmv: Dimensions do not match: " - << ", A: " << A.numRows() << " x " << A.numCols() - << ", x: " << x.extent(0) << " x " << x.extent(1) - << ", y: " << y.extent(0) << " x " << y.extent(1); - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } else { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols()) > static_cast(y.extent(0))) || - (static_cast(A.numRows()) > static_cast(x.extent(0)))) { - std::ostringstream os; - os << "KokkosBlas::spmv: Dimensions do not match (transpose): " - << ", A: " << A.numRows() << " x " << A.numCols() - << ", x: " << x.extent(0) << " x " << x.extent(1) - << ", y: " << y.extent(0) << " x " << y.extent(1); - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } - - typedef KokkosSparse::CrsMatrix< - typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, - typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::const_size_type> - AMatrix_Internal; - - AMatrix_Internal A_i = A; - - // Call single-vector version if appropriate - if (x.extent(1) == 1) { - typedef Kokkos::View< - typename XVector::const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_SubInternal; - typedef Kokkos::View< - typename YVector::non_const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > - YVector_SubInternal; - - XVector_SubInternal x_i = Kokkos::subview(x, Kokkos::ALL(), 0); - YVector_SubInternal y_i = Kokkos::subview(y, Kokkos::ALL(), 0); - - // spmv (mode, alpha, A, x_i, beta, y_i); - using impl_type = - Impl::SPMV2D1D; - if (impl_type::spmv2d1d(space, mode, alpha, A, x_i, beta, y_i)) { - return; - } - } - { - typedef Kokkos::View< - typename XVector::const_value_type**, typename XVector::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_Internal; - - typedef Kokkos::View > - YVector_Internal; - - XVector_Internal x_i = x; - YVector_Internal y_i = y; - - bool useNative = false; - -// cusparseSpMM does not support conjugate mode -#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - useNative = useNative || (Conjugate[0] == mode[0]); -#endif - useNative = useNative || (controls.isParameter("algorithm") && - (controls.getParameter("algorithm") != "tpl")); - - if (useNative) { - return Impl::SPMV_MV< - ExecutionSpace, AMatrix_Internal, XVector_Internal, YVector_Internal, - std::is_integral::value, - false>::spmv_mv(space, controls, mode, alpha, A_i, x_i, beta, y_i); - } else { - return Impl::SPMV_MV::spmv_mv(space, controls, mode, - alpha, A_i, x_i, beta, - y_i); - } - } -} - -/// \brief Kokkos sparse matrix-vector multiply on multivectors -/// (RANK_TWO tag). Computes y := alpha*Op(A)*x + beta*y, where Op(A) is -/// controlled by mode (see below). -/// -/// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank-2 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank-2 Kokkos::View and its rank must match that of XVector -/// -/// \param controls [in] kokkos-kernels control structure. -/// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. -/// \param x [in] A vector to multiply on the left by A. -/// \param beta [in] Scalar multiplier for the vector y. -/// \param y [in/out] Result vector. -/// \param tag RANK_TWO dispatch -#ifdef DOXY -template -#else -template ::value>::type* = nullptr> -#endif -void spmv(KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, const RANK_TWO& tag) { - spmv(typename AMatrix::execution_space{}, controls, mode, alpha, A, x, beta, - y, tag); -} - -#ifndef DOXY // hide SFINAE -template ::value>::type* = nullptr> -void spmv(const ExecutionSpace& space, - KokkosKernels::Experimental::Controls controls, const char mode[], +/// transpose, "C" for conjugate or "H" for conjugate transpose. +/// \param alpha [in] Scalar multiplier for the matrix A. +/// \param A [in] The sparse matrix A. If handle has previously been passed to spmv, A must be identical to the +/// A passed in to that first call. +/// \param x [in] A vector to multiply on the left by A. +/// \param beta [in] Scalar multiplier for the vector y. +/// \param y [in/out] Result vector. +// clang-format on +template +void spmv(const ExecutionSpace& space, Handle* handle, const char mode[], const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, - [[maybe_unused]] const RANK_TWO& tag) { + const BetaType& beta, const YVector& y) { + // Make sure A is a CrsMatrix or BsrMatrix. + static_assert( + is_crs_matrix_v || Experimental::is_bsr_matrix_v, + "KokkosSparse::spmv: AMatrix must be a CrsMatrix or BsrMatrix"); // Make sure that x and y are Views. static_assert(Kokkos::is_view::value, "KokkosSparse::spmv: XVector must be a Kokkos::View."); @@ -859,459 +90,449 @@ void spmv(const ExecutionSpace& space, static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: AMatrix must be accessible from ExecutionSpace"); + "KokkosSparse::spmv: AMatrix must be accessible from ExecutionSpace"); static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: XVector must be accessible from ExecutionSpace"); + "KokkosSparse::spmv: XVector must be accessible from ExecutionSpace"); static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: YVector must be accessible from ExecutionSpace"); + "KokkosSparse::spmv: YVector must be accessible from ExecutionSpace"); // Make sure that x and y have the same rank. - static_assert( - static_cast(XVector::rank) == static_cast(YVector::rank), - "KokkosSparse::spmv: Vector ranks do not match."); - // Make sure that x (and therefore y) is rank 2. - static_assert(static_cast(XVector::rank) == 2, - "KokkosSparse::spmv: Both Vector inputs must have rank 2 " - "in order to call this specialization of spmv."); + static_assert(XVector::rank() == YVector::rank(), + "KokkosSparse::spmv: Vector ranks do not match."); + // Make sure that x (and therefore y) is rank 1 or 2. + static_assert(XVector::rank() == size_t(1) || XVector::rank() == size_t(2), + "KokkosSparse::spmv: Both Vector inputs must have rank 1 or 2"); // Make sure that y is non-const. - static_assert(std::is_same::value, + static_assert(!std::is_const_v, "KokkosSparse::spmv: Output Vector must be non-const."); - // - if (A.blockDim() == 1) { - KokkosSparse::CrsMatrix< - typename AMatrix::value_type, typename AMatrix::ordinal_type, - typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::size_type> - Acrs("bsr_to_crs", A.numCols(), A.values, A.graph); - KokkosSparse::spmv(space, controls, mode, alpha, Acrs, x, beta, y, - RANK_TWO()); - return; + // Check that A, X, Y types match that of the Handle + // But only check this if Handle is the user-facing type (SPMVHandle). + // We may internally call spmv with SPMVHandleImpl, which does not include + // the matrix and vector types. + if constexpr (KokkosSparse::Impl::is_spmv_handle_v) { + static_assert( + std::is_same_v, + "KokkosSparse::spmv: AMatrix must be identical to Handle::AMatrixType"); + static_assert( + std::is_same_v, + "KokkosSparse::spmv: XVector must be identical to Handle::XVectorType"); + static_assert( + std::is_same_v, + "KokkosSparse::spmv: YVector must be identical to Handle::YVectorType"); } + + constexpr bool isBSR = Experimental::is_bsr_matrix_v; + // Check compatibility of dimensions at run time. + size_t m, n; + + if constexpr (!isBSR) { + m = A.numRows(); + n = A.numCols(); + } else { + m = A.numRows() * A.blockDim(); + n = A.numCols() * A.blockDim(); + } + if ((mode[0] == NoTranspose[0]) || (mode[0] == Conjugate[0])) { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols() * A.blockDim()) != - static_cast(x.extent(0))) || - (static_cast(A.numRows() * A.blockDim()) != - static_cast(y.extent(0)))) { + if ((x.extent(1) != y.extent(1)) || (n != x.extent(0)) || + (m != y.extent(0))) { std::ostringstream os; - os << "KokkosSparse::spmv (BsrMatrix): Dimensions do not match: " - << ", A: " << A.numRows() * A.blockDim() << " x " - << A.numCols() * A.blockDim() << ", x: " << x.extent(0) << " x " + os << "KokkosSparse::spmv: Dimensions do not match: " + << ", A: " << m << " x " << n << ", x: " << x.extent(0) << " x " << x.extent(1) << ", y: " << y.extent(0) << " x " << y.extent(1); - KokkosKernels::Impl::throw_runtime_exception(os.str()); } } else { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numCols() * A.blockDim()) != - static_cast(y.extent(0))) || - (static_cast(A.numRows() * A.blockDim()) != - static_cast(x.extent(0)))) { + if ((x.extent(1) != y.extent(1)) || (m != x.extent(0)) || + (n != y.extent(0))) { std::ostringstream os; - os << "KokkosSparse::spmv (BsrMatrix): Dimensions do not match " - "(transpose): " - << ", A: " << A.numRows() * A.blockDim() << " x " - << A.numCols() * A.blockDim() << ", x: " << x.extent(0) << " x " - << x.extent(1) << ", y: " << y.extent(0) << " x " << y.extent(1); - + os << "KokkosSparse::spmv: Dimensions do not match (transpose): " + << ", A: " << A.numRows() << " x " << A.numCols() + << ", x: " << x.extent(0) << " x " << x.extent(1) + << ", y: " << y.extent(0) << " x " << y.extent(1); KokkosKernels::Impl::throw_runtime_exception(os.str()); } } - // - typedef KokkosSparse::Experimental::BsrMatrix< - typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, - typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::const_size_type> - AMatrix_Internal; - AMatrix_Internal A_i(A); - - typedef Kokkos::View< - typename XVector::const_value_type**, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_Internal; - XVector_Internal x_i(x); - typedef Kokkos::View< - typename YVector::non_const_value_type**, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > - YVector_Internal; - YVector_Internal y_i(y); - // - if (alpha == Kokkos::ArithTraits::zero() || A_i.numRows() == 0 || - A_i.numCols() == 0 || A_i.nnz() == 0) { + // Efficiently handle cases where alpha*Op(A) is equivalent to the zero matrix + if (alpha == Kokkos::ArithTraits::zero() || m == 0 || n == 0 || + A.nnz() == 0) { // This is required to maintain semantics of KokkosKernels native SpMV: // if y contains NaN but beta = 0, the result y should be filled with 0. // For example, this is useful for passing in uninitialized y and beta=0. if (beta == Kokkos::ArithTraits::zero()) - Kokkos::deep_copy(space, y_i, Kokkos::ArithTraits::zero()); + Kokkos::deep_copy(space, y, Kokkos::ArithTraits::zero()); else - KokkosBlas::scal(space, y_i, beta, y_i); + KokkosBlas::scal(space, y, beta, y); return; } - // - // Call single-vector version if appropriate - // - if (x.extent(1) == 1) { - typedef Kokkos::View< - typename XVector::const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_SubInternal; - typedef Kokkos::View< - typename YVector::non_const_value_type*, - typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > - YVector_SubInternal; - XVector_SubInternal x_0 = Kokkos::subview(x_i, Kokkos::ALL(), 0); - YVector_SubInternal y_0 = Kokkos::subview(y_i, Kokkos::ALL(), 0); + // Get the "impl" parent class of Handle, if it's not already the impl + using HandleImpl = typename Handle::ImplType; - return spmv(space, controls, mode, alpha, A_i, x_0, beta, y_0, RANK_ONE()); - } + using ACrs_Internal = CrsMatrix< + typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, + typename AMatrix::device_type, Kokkos::MemoryTraits, + typename AMatrix::const_size_type>; + using ABsr_Internal = Experimental::BsrMatrix< + typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, + typename AMatrix::device_type, Kokkos::MemoryTraits, + typename AMatrix::const_size_type>; + + using AMatrix_Internal = + std::conditional_t; + + // Intercept special case: A is a BsrMatrix with blockDim() == 1 + // This is exactly equivalent to CrsMatrix (more performant) + // and cuSPARSE actually errors out in that case. // - // Whether to call KokkosKernel's native implementation, even if a TPL impl is - // available - bool useFallback = controls.isParameter("algorithm") && - (controls.getParameter("algorithm") != "tpl"); + // This relies on the fact that this codepath will always be taken for + // this particular matrix (so internally, this handle is only ever used for + // Crs) + if constexpr (isBSR) { + if (A.blockDim() == 1) { + // Construct an ACrs_Internal (unmanaged memory) from A's views + typename ACrs_Internal::row_map_type rowmap(A.graph.row_map); + typename ACrs_Internal::index_type entries(A.graph.entries); + typename ACrs_Internal::values_type values(A.values); + ACrs_Internal ACrs(std::string{}, A.numRows(), A.numCols(), A.nnz(), + values, rowmap, entries); + spmv(space, handle->get_impl(), mode, alpha, ACrs, x, beta, y); + return; + } + } + + AMatrix_Internal A_i(A); + + // Note: data_type of a View includes both the scalar and rank + using XVector_Internal = Kokkos::View< + typename XVector::const_data_type, + typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, + typename XVector::device_type, + Kokkos::MemoryTraits>; + + using YVector_Internal = Kokkos::View< + typename YVector::non_const_data_type, + typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, + typename YVector::device_type, Kokkos::MemoryTraits>; + + XVector_Internal x_i(x); + YVector_Internal y_i(y); + bool useNative = is_spmv_algorithm_native(handle->get_algorithm()); + // Also use the native algorithm if SPMV_FAST_SETUP was selected and + // rocSPARSE is the possible TPL to use. Native is faster in this case. +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + if (handle->get_algorithm() == SPMV_FAST_SETUP && + std::is_same_v) + useNative = true; +#endif + + // Now call the proper implementation depending on isBSR and the rank of X/Y + if constexpr (!isBSR) { + if constexpr (XVector::rank() == 1) { +///////////////// +// CRS, rank 1 // +///////////////// #ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - // cuSPARSE does not support the modes (C), (T), (H) - if (std::is_same::value || - std::is_same::value) { - useFallback = useFallback || (mode[0] != NoTranspose[0]); - } + // cuSPARSE does not support the conjugate mode (C) + if constexpr (std::is_same_v || + std::is_same_v) { + useNative = useNative || (mode[0] == Conjugate[0]); + } + // cuSPARSE 12 requires that the output (y) vector is 16-byte aligned for + // all scalar types +#if defined(CUSPARSE_VER_MAJOR) && (CUSPARSE_VER_MAJOR == 12) + uintptr_t yptr = uintptr_t((void*)y.data()); + if (yptr % 16 != 0) useNative = true; +#endif +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + if (std::is_same::value) { + useNative = useNative || (mode[0] != NoTranspose[0]); + } #endif #ifdef KOKKOSKERNELS_ENABLE_TPL_MKL - if (std::is_same::value) { - useFallback = useFallback || (mode[0] == Conjugate[0]); - } + if (std::is_same_v) { + useNative = useNative || (mode[0] == Conjugate[0]); + } +#ifdef KOKKOS_ENABLE_SYCL + if (std::is_same_v) { + useNative = useNative || (mode[0] == Conjugate[0]); + } +#endif +#endif + if (useNative) { + // Explicitly call the non-TPL SPMV implementation + std::string label = + "KokkosSparse::spmv[NATIVE," + + Kokkos::ArithTraits< + typename AMatrix_Internal::non_const_value_type>::name() + + "]"; + Kokkos::Profiling::pushRegion(label); + Impl::SPMV::spmv(space, + handle, + mode, alpha, + A_i, x_i, + beta, y_i); + Kokkos::Profiling::popRegion(); + } else { + // note: the cuSPARSE spmv wrapper defines a profiling region, so one is + // not needed here. + Impl::SPMV::spmv(space, handle, + mode, alpha, A_i, + x_i, beta, y_i); + } + } else { +///////////////// +// CRS, rank 2 // +///////////////// +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + useNative = useNative || (Conjugate[0] == mode[0]); #endif - if (useFallback) { - // Explicitly call the non-TPL SPMV_BSRMATRIX implementation - std::string label = - "KokkosSparse::spmv[NATIVE,BSMATRIX," + - Kokkos::ArithTraits< - typename AMatrix_Internal::non_const_value_type>::name() + - "]"; - Kokkos::Profiling::pushRegion(label); - Experimental::Impl::SPMV_MV_BSRMATRIX< - ExecutionSpace, AMatrix_Internal, XVector_Internal, YVector_Internal, - std::is_integral::value, - false>::spmv_mv_bsrmatrix(space, controls, mode, alpha, A_i, x_i, beta, - y_i); - Kokkos::Profiling::popRegion(); + if (useNative) { + std::string label = + "KokkosSparse::spmv[NATIVE,MV," + + Kokkos::ArithTraits< + typename AMatrix_Internal::non_const_value_type>::name() + + "]"; + Kokkos::Profiling::pushRegion(label); + return Impl::SPMV_MV< + ExecutionSpace, HandleImpl, AMatrix_Internal, XVector_Internal, + YVector_Internal, + std::is_integral::value, + false>::spmv_mv(space, handle, mode, alpha, A_i, x_i, beta, y_i); + Kokkos::Profiling::popRegion(); + } else { + return Impl::SPMV_MV::spmv_mv(space, handle, mode, + alpha, A_i, x_i, beta, + y_i); + } + } } else { - Experimental::Impl::SPMV_MV_BSRMATRIX< - ExecutionSpace, AMatrix_Internal, XVector_Internal, YVector_Internal, - std::is_integral::value>:: - spmv_mv_bsrmatrix(space, controls, mode, alpha, A_i, x_i, beta, y_i); - } -} - -template ::value>::type* = nullptr> -void spmv(KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y, const RANK_TWO& tag) { - spmv(typename AMatrix::execution_space{}, controls, mode, alpha, A, x, beta, - y, tag); -} + if constexpr (XVector::rank() == 1) { +///////////////// +// BSR, rank 1 // +///////////////// +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + // cuSPARSE does not support the modes (C), (T), (H) + if (std::is_same::value || + std::is_same::value) { + useNative = useNative || (mode[0] != NoTranspose[0]); + } #endif -/// \brief Public interface to local sparse matrix-vector multiply. -/// -/// Compute y := beta*y + alpha*Op(A)*x, where x and y are either both -/// rank 1 (single vectors) or rank 2 (multivectors) Kokkos::View -/// instances, and Op(A) is determined -/// by \c mode. If beta == 0, ignore and overwrite the initial -/// entries of y; if alpha == 0, ignore the entries of A and x. -/// -/// If \c AMatrix is a KokkosSparse::Experimental::BsrMatrix, controls may have -/// \c "algorithm" = \c "experimental_bsr_tc" to use Nvidia tensor cores on -/// Volta or Ampere architectures. On Volta-architecture GPUs the only available -/// precision is mixed-precision fp32 accumulator from fp16 inputs. On -/// Ampere-architecture GPUs (cc >= 80), mixed precision is used when A is fp16, -/// x is fp16, and y is fp32. Otherwise, double-precision is used. The caller -/// may override this by setting the \c "tc_precision" = \c "mixed" or -/// \c "double" as desired. -/// -/// For mixed precision, performance will degrade for blockDim < 16. -/// For double precision, for blockDim < 8. -/// For such cases, consider an alternate SpMV algorithm. -/// -/// May have \c "algorithm" set to \c "native" to bypass TPLs if they are -/// enabled for Kokkos::CrsMatrix and Kokkos::Experimental::BsrMatrix on a -/// single vector, or for Kokkos::Experimental::BsrMatrix with a multivector. -/// -/// \tparam ExecutionSpace A Kokkos execution space. Must be able to access -/// the memory spaces of A, x, and y. -/// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank 1 or 2 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank 1 or 2 Kokkos::View and its rank must match that of XVector -/// -/// \param space [in] The execution space instance on which to run the -/// kernel. -/// \param controls [in] kokkos-kernels control structure -/// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. -/// \param x [in] Either a single vector (rank-1 Kokkos::View) or -/// multivector (rank-2 Kokkos::View). -/// \param beta [in] Scalar multiplier for the (multi)vector y. -/// \param y [in/out] Either a single vector (rank-1 Kokkos::View) or -/// multivector (rank-2 Kokkos::View). It must have the same number -/// of columns as x. -template -void spmv(const ExecutionSpace& space, - KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y) { - // Make sure that x and y are Views. - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: XVector must be a Kokkos::View."); - static_assert(Kokkos::is_view::value, - "KokkosSparse::spmv: YVector must be a Kokkos::View."); - // Make sure A, x, y are accessible to ExecutionSpace - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: AMatrix must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: XVector must be accessible from ExecutionSpace"); - static_assert( - Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv: YVector must be accessible from ExecutionSpace"); - // Make sure that both x and y have the same rank. - static_assert( - static_cast(XVector::rank) == static_cast(YVector::rank), - "KokkosSparse::spmv: Vector ranks do not match."); - // Make sure that y is non-const. - static_assert(std::is_same::value, - "KokkosSparse::spmv: Output Vector must be non-const."); - - // Check compatibility of dimensions at run time. - if ((mode[0] == NoTranspose[0]) || (mode[0] == Conjugate[0])) { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numPointCols()) != - static_cast(x.extent(0))) || - (static_cast(A.numPointRows()) != - static_cast(y.extent(0)))) { - std::ostringstream os; - os << "KokkosSparse::spmv (Generic): Dimensions do not match: " - << ", A: " << A.numPointRows() << " x " << A.numPointCols() - << ", x: " << x.extent(0) << " x " << x.extent(1) - << ", y: " << y.extent(0) << " x " << y.extent(1); +#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL + if (std::is_same::value) { + useNative = useNative || (mode[0] == Conjugate[0]); + } +#endif - KokkosKernels::Impl::throw_runtime_exception(os.str()); - } - } else { - if ((x.extent(1) != y.extent(1)) || - (static_cast(A.numPointCols()) != - static_cast(y.extent(0))) || - (static_cast(A.numPointRows()) != - static_cast(x.extent(0)))) { - std::ostringstream os; - os << "KokkosSparse::spmv (Generic): Dimensions do not match " - "(transpose): " - << ", A: " << A.numPointRows() << " x " << A.numPointCols() - << ", x: " << x.extent(0) << " x " << x.extent(1) - << ", y: " << y.extent(0) << " x " << y.extent(1); +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + // rocSparse does not support the modes (C), (T), (H) + if constexpr (std::is_same_v) { + useNative = useNative || (mode[0] != NoTranspose[0]); + } +#endif + if (useNative) { + // Explicitly call the non-TPL SPMV_BSRMATRIX implementation + std::string label = + "KokkosSparse::spmv[NATIVE,BSRMATRIX," + + Kokkos::ArithTraits< + typename AMatrix_Internal::non_const_value_type>::name() + + "]"; + Kokkos::Profiling::pushRegion(label); + Impl::SPMV_BSRMATRIX::spmv_bsrmatrix(space, handle, mode, alpha, + A_i, x_i, beta, y_i); + Kokkos::Profiling::popRegion(); + } else { + Impl::SPMV_BSRMATRIX::spmv_bsrmatrix(space, handle, + mode, alpha, A_i, + x_i, beta, y_i); + } + } else { + ///////////////// + // BSR, rank 2 // + ///////////////// +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + // cuSPARSE does not support the modes (C), (T), (H) + if (std::is_same::value || + std::is_same::value) { + useNative = useNative || (mode[0] != NoTranspose[0]); + } +#endif - KokkosKernels::Impl::throw_runtime_exception(os.str()); +#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL + if (std::is_same::value) { + useNative = useNative || (mode[0] == Conjugate[0]); + } +#endif + if (useNative) { + // Explicitly call the non-TPL SPMV_BSRMATRIX implementation + std::string label = + "KokkosSparse::spmv[NATIVE,MV,BSMATRIX," + + Kokkos::ArithTraits< + typename AMatrix_Internal::non_const_value_type>::name() + + "]"; + Kokkos::Profiling::pushRegion(label); + Impl::SPMV_MV_BSRMATRIX< + ExecutionSpace, HandleImpl, AMatrix_Internal, XVector_Internal, + YVector_Internal, + std::is_integral< + typename AMatrix_Internal::const_value_type>::value, + false>::spmv_mv_bsrmatrix(space, handle, mode, alpha, A_i, x_i, + beta, y_i); + Kokkos::Profiling::popRegion(); + } else { + Impl::SPMV_MV_BSRMATRIX< + ExecutionSpace, HandleImpl, AMatrix_Internal, XVector_Internal, + YVector_Internal, + std::is_integral:: + value>::spmv_mv_bsrmatrix(space, handle, mode, alpha, A_i, x_i, + beta, y_i); + } } } - - if (alpha == Kokkos::ArithTraits::zero() || A.numRows() == 0 || - A.numCols() == 0 || A.nnz() == 0) { - // This is required to maintain semantics of KokkosKernels native SpMV: - // if y contains NaN but beta = 0, the result y should be filled with 0. - // For example, this is useful for passing in uninitialized y and beta=0. - if (beta == Kokkos::ArithTraits::zero()) - Kokkos::deep_copy(space, y, Kokkos::ArithTraits::zero()); - else - KokkosBlas::scal(space, y, beta, y); - return; - } - // - using RANK_SPECIALISE = - typename std::conditional(XVector::rank) == 2, RANK_TWO, - RANK_ONE>::type; - spmv(space, controls, mode, alpha, A, x, beta, y, RANK_SPECIALISE()); } -/// \brief Public interface to local sparse matrix-vector multiply. -/// -/// Compute y = beta*y + alpha*Op(A)*x, where x and y are either both -/// rank 1 (single vectors) or rank 2 (multivectors) Kokkos::View -/// instances, and Op(A) is determined -/// by \c mode. If beta == 0, ignore and overwrite the initial -/// entries of y; if alpha == 0, ignore the entries of A and x. -/// -/// If \c AMatrix is a KokkosSparse::Experimental::BsrMatrix, controls may have -/// \c "algorithm" = \c "experimental_bsr_tc" to use Nvidia tensor cores on -/// Volta or Ampere architectures. On Volta-architecture GPUs the only available -/// precision is mixed-precision fp32 accumulator from fp16 inputs. On -/// Ampere-architecture GPUs (cc >= 80), mixed precision is used when A is fp16, -/// x is fp16, and y is fp32. Otherwise, double-precision is used. The caller -/// may override this by setting the \c "tc_precision" = \c "mixed" or -/// \c "double" as desired. -/// -/// For mixed precision, performance will degrade for blockDim < 16. -/// For double precision, for blockDim < 8. -/// For such cases, consider an alternate SpMV algorithm. -/// -/// May have \c "algorithm" set to \c "native" to bypass TPLs if they are -/// enabled for Kokkos::CrsMatrix and Kokkos::Experimental::BsrMatrix on a -/// single vector, or for Kokkos::Experimental::BsrMatrix with a multivector. +// clang-format off +/// \brief Kokkos sparse matrix-vector multiply. +/// Computes y := alpha*Op(A)*x + beta*y, where Op(A) is controlled by mode +/// (see below). /// -/// \tparam AMatrix KokkosSparse::CrsMatrix or -/// KokkosSparse::Experimental::BsrMatrix +/// \tparam ExecutionSpace A Kokkos execution space. Must be able to access +/// the memory spaces of A, x, and y. +/// \tparam AlphaType Type of coefficient alpha. Must be convertible to +/// YVector::value_type. +/// \tparam AMatrix A KokkosSparse::CrsMatrix, or KokkosSparse::Experimental::BsrMatrix +/// \tparam XVector Type of x, must be a rank-1 or rank-2 Kokkos::View +/// \tparam BetaType Type of coefficient beta. Must be convertible to YVector::value_type. +/// \tparam YVector Type of y, must be a Kokkos::View and its rank must match that of XVector /// -/// \param controls [in] kokkos-kernels control structure -/// \param mode [in] "N" for no transpose, "T" for transpose, or "C" -/// for conjugate transpose. +/// \param space [in] The execution space instance on which to run the kernel. +/// \param mode [in] Select A's operator mode: "N" for normal, "T" for +/// transpose, "C" for conjugate or "H" for conjugate transpose. /// \param alpha [in] Scalar multiplier for the matrix A. /// \param A [in] The sparse matrix A. -/// \param x [in] Either a single vector (rank-1 Kokkos::View) or -/// multivector (rank-2 Kokkos::View). -/// \param beta [in] Scalar multiplier for the (multi)vector y. -/// \param y [in/out] Either a single vector (rank-1 Kokkos::View) or -/// multivector (rank-2 Kokkos::View). It must have the same number -/// of columns as x. -template -void spmv(KokkosKernels::Experimental::Controls controls, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y) { - spmv(typename AMatrix::execution_space{}, controls, mode, alpha, A, x, beta, - y); -} - -#ifndef DOXY -/// \brief Catch-all public interface to error on invalid Kokkos::Sparse spmv -/// argument types -/// -/// This is a catch-all interface that throws a compile-time error if \c -/// AMatrix is not a CrsMatrix, or BsrMatrix -/// -template ::value && - !KokkosSparse::is_crs_matrix::value>::type* = nullptr> -void spmv(KokkosKernels::Experimental::Controls /*controls*/, - const char[] /*mode*/, const AlphaType& /*alpha*/, - const AMatrix& /*A*/, const XVector& /*x*/, const BetaType& /*beta*/, - const YVector& /*y*/) { - // have to arrange this so that the compiler can't tell this is false until - // instantiation - static_assert(KokkosSparse::is_crs_matrix::value || - KokkosSparse::Experimental::is_bsr_matrix::value, - "SpMV: AMatrix must be CrsMatrix or BsrMatrix"); -} - -/// \brief Catch-all public interface to error on invalid Kokkos::Sparse spmv -/// argument types -/// -/// This is a catch-all interface that throws a compile-time error if \c -/// AMatrix is not a CrsMatrix, or BsrMatrix -/// +/// \param x [in] A vector to multiply on the left by A. +/// \param beta [in] Scalar multiplier for the vector y. +/// \param y [in/out] Result vector. +// clang-format on template ::value && - !KokkosSparse::is_crs_matrix::value>::type* = nullptr> -void spmv(const ExecutionSpace& /* space */, - KokkosKernels::Experimental::Controls /*controls*/, - const char[] /*mode*/, const AlphaType& /*alpha*/, - const AMatrix& /*A*/, const XVector& /*x*/, const BetaType& /*beta*/, - const YVector& /*y*/) { - // have to arrange this so that the compiler can't tell this is false until - // instantiation - static_assert(KokkosSparse::is_crs_matrix::value || - KokkosSparse::Experimental::is_bsr_matrix::value, - "SpMV: AMatrix must be CrsMatrix or BsrMatrix"); + typename = std::enable_if_t< + Kokkos::is_execution_space::value>> +void spmv(const ExecutionSpace& space, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y) { + SPMVAlgorithm algo = SPMV_FAST_SETUP; + // Without handle reuse, native is overall faster than rocSPARSE +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + if constexpr (std::is_same_v) + algo = SPMV_NATIVE; +#endif + SPMVHandle + handle(algo); + spmv(space, &handle, mode, alpha, A, x, beta, y); } -#endif // ifndef DOXY +// clang-format off /// \brief Kokkos sparse matrix-vector multiply. /// Computes y := alpha*Op(A)*x + beta*y, where Op(A) is controlled by mode /// (see below). /// +/// \tparam Handle Specialization of KokkosSparse::SPMVHandle /// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank-2 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank-2 Kokkos::View and its rank must match that of XVector +/// YVector::value_type. +/// \tparam AMatrix A KokkosSparse::CrsMatrix, or +/// KokkosSparse::Experimental::BsrMatrix. Must be identical to Handle::AMatrixType. +/// \tparam XVector Type of x. Must be a rank-1 or 2 Kokkos::View and be identical to Handle::XVectorType. +/// \tparam BetaType Type of coefficient beta. Must be convertible to YVector::value_type. +/// \tparam YVector Type of y. Must have the same rank as XVector and be identical to Handle::YVectorType. /// +/// \param handle [in/out] a pointer to a KokkosSparse::SPMVHandle. On the first call to spmv with +/// a given handle instance, the handle's internal data will be initialized automatically. +/// On all later calls to spmv, this internal data will be reused. /// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. +/// transpose, "C" for conjugate or "H" for conjugate transpose. +/// \param alpha [in] Scalar multiplier for the matrix A. +/// \param A [in] The sparse matrix A. /// \param x [in] A vector to multiply on the left by A. /// \param beta [in] Scalar multiplier for the vector y. /// \param y [in/out] Result vector. -template -void spmv(const char mode[], const AlphaType& alpha, const AMatrix& A, - const XVector& x, const BetaType& beta, const YVector& y) { - KokkosKernels::Experimental::Controls controls; - spmv(controls, mode, alpha, A, x, beta, y); +// clang-format on +template < + class Handle, class AlphaType, class AMatrix, class XVector, class BetaType, + class YVector, + typename = std::enable_if_t::value>> +void spmv(Handle* handle, const char mode[], const AlphaType& alpha, + const AMatrix& A, const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(typename Handle::ExecutionSpaceType(), handle, mode, alpha, A, x, beta, + y); } +// clang-format off /// \brief Kokkos sparse matrix-vector multiply. /// Computes y := alpha*Op(A)*x + beta*y, where Op(A) is controlled by mode /// (see below). /// -/// \tparam ExecutionSpace A Kokkos execution space. Must be able to access -/// the memory spaces of A, x, and y. -/// \tparam AlphaType Type of coefficient alpha. Must be convertible to -/// YVector::value_type. \tparam AMatrix A KokkosSparse::CrsMatrix, or -/// KokkosSparse::Experimental::BsrMatrix \tparam XVector Type of x, must be a -/// rank-2 Kokkos::View \tparam BetaType Type of coefficient beta. Must be -/// convertible to YVector::value_type. \tparam YVector Type of y, must be a -/// rank-2 Kokkos::View and its rank must match that of XVector +/// \tparam AlphaType Type of coefficient alpha. Must be convertible to YVector::value_type. +/// \tparam AMatrix A KokkosSparse::CrsMatrix, or KokkosSparse::Experimental::BsrMatrix +/// \tparam XVector Type of x, must be a rank-1 or rank-2 Kokkos::View +/// \tparam BetaType Type of coefficient beta. Must be convertible to YVector::value_type. +/// \tparam YVector Type of y, must be a Kokkos::View and its rank must match that of XVector /// -/// \param space [in] The execution space instance on which to run the -/// kernel. /// \param mode [in] Select A's operator mode: "N" for normal, "T" for -/// transpose, "C" for conjugate or "H" for conjugate transpose. \param alpha -/// [in] Scalar multiplier for the matrix A. \param A [in] The sparse matrix A. +/// transpose, "C" for conjugate or "H" for conjugate transpose. +/// \param alpha [in] Scalar multiplier for the matrix A. +/// \param A [in] The sparse matrix A. /// \param x [in] A vector to multiply on the left by A. /// \param beta [in] Scalar multiplier for the vector y. /// \param y [in/out] Result vector. -template -void spmv(const ExecutionSpace& space, const char mode[], - const AlphaType& alpha, const AMatrix& A, const XVector& x, - const BetaType& beta, const YVector& y) { - KokkosKernels::Experimental::Controls controls; - spmv(space, controls, mode, alpha, A, x, beta, y); +// clang-format on +template +void spmv(const char mode[], const AlphaType& alpha, const AMatrix& A, + const XVector& x, const BetaType& beta, const YVector& y) { + SPMVAlgorithm algo = SPMV_FAST_SETUP; + // Without handle reuse, native is overall faster than rocSPARSE +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + if constexpr (std::is_same_v) + algo = SPMV_NATIVE; +#endif + SPMVHandle + handle(algo); + spmv(typename AMatrix::execution_space(), &handle, mode, alpha, A, x, beta, + y); } namespace Experimental { @@ -1332,17 +553,17 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv_struct: AMatrix must be accessible from " + "KokkosSparse::spmv_struct: AMatrix must be accessible from " "ExecutionSpace"); static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv_struct: XVector must be accessible from " + "KokkosSparse::spmv_struct: XVector must be accessible from " "ExecutionSpace"); static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv_struct: YVector must be accessible from " + "KokkosSparse::spmv_struct: YVector must be accessible from " "ExecutionSpace"); // Make sure that x (and therefore y) is rank 1. static_assert( @@ -1391,13 +612,13 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], typename XVector::const_value_type*, typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, typename XVector::device_type, - Kokkos::MemoryTraits > + Kokkos::MemoryTraits> XVector_Internal; typedef Kokkos::View< typename YVector::non_const_value_type*, typename KokkosKernels::Impl::GetUnifiedLayout::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > + typename YVector::device_type, Kokkos::MemoryTraits> YVector_Internal; AMatrix_Internal A_i = A; @@ -1627,25 +848,25 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv_struct: AMatrix must be accessible from " + "KokkosSparse::spmv_struct: AMatrix must be accessible from " "ExecutionSpace"); static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv_struct: XVector must be accessible from " + "KokkosSparse::spmv_struct: XVector must be accessible from " "ExecutionSpace"); static_assert( Kokkos::SpaceAccessibility::accessible, - "KokkosBlas::spmv_struct: YVector must be accessible from " + "KokkosSparse::spmv_struct: YVector must be accessible from " "ExecutionSpace"); // Make sure that both x and y have the same rank. static_assert(XVector::rank == YVector::rank, - "KokkosBlas::spmv: Vector ranks do not match."); + "KokkosSparse::spmv: Vector ranks do not match."); // Make sure that y is non-const. static_assert(std::is_same::value, - "KokkosBlas::spmv: Output Vector must be non-const."); + "KokkosSparse::spmv: Output Vector must be non-const."); // Check compatibility of dimensions at run time. if ((mode[0] == NoTranspose[0]) || (mode[0] == Conjugate[0])) { @@ -1653,7 +874,7 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], (static_cast(A.numCols()) > static_cast(x.extent(0))) || (static_cast(A.numRows()) > static_cast(y.extent(0)))) { std::ostringstream os; - os << "KokkosBlas::spmv: Dimensions do not match: " + os << "KokkosSparse::spmv: Dimensions do not match: " << ", A: " << A.numRows() << " x " << A.numCols() << ", x: " << x.extent(0) << " x " << x.extent(1) << ", y: " << y.extent(0) << " x " << y.extent(1); @@ -1664,7 +885,7 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], (static_cast(A.numCols()) > static_cast(y.extent(0))) || (static_cast(A.numRows()) > static_cast(x.extent(0)))) { std::ostringstream os; - os << "KokkosBlas::spmv: Dimensions do not match (transpose): " + os << "KokkosSparse::spmv: Dimensions do not match (transpose): " << ", A: " << A.numRows() << " x " << A.numCols() << ", x: " << x.extent(0) << " x " << x.extent(1) << ", y: " << y.extent(0) << " x " << y.extent(1); @@ -1685,11 +906,11 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], typedef Kokkos::View< typename XVector::const_value_type*, typename YVector::array_layout, typename XVector::device_type, - Kokkos::MemoryTraits > + Kokkos::MemoryTraits> XVector_SubInternal; typedef Kokkos::View< typename YVector::non_const_value_type*, typename YVector::array_layout, - typename YVector::device_type, Kokkos::MemoryTraits > + typename YVector::device_type, Kokkos::MemoryTraits> YVector_SubInternal; XVector_SubInternal x_i = Kokkos::subview(x, Kokkos::ALL(), 0); @@ -1706,28 +927,7 @@ void spmv_struct(const ExecutionSpace& space, const char mode[], } // Call true rank 2 vector implementation - { - typedef Kokkos::View< - typename XVector::const_value_type**, typename XVector::array_layout, - typename XVector::device_type, - Kokkos::MemoryTraits > - XVector_Internal; - - typedef Kokkos::View > - YVector_Internal; - - XVector_Internal x_i = x; - YVector_Internal y_i = y; - - return KokkosSparse::Impl::SPMV_MV< - ExecutionSpace, AMatrix_Internal, XVector_Internal, - YVector_Internal>::spmv_mv(space, - KokkosKernels::Experimental::Controls(), - mode, alpha, A_i, x_i, beta, y_i); - } + spmv(space, mode, alpha, A, x, beta, y); } template +struct SPMV2D1D { + static bool spmv2d1d(const char mode[], const AlphaType& alpha, + const AMatrix& A, const XVector& x, const BetaType& beta, + const YVector& y); + + template + static bool spmv2d1d(const ExecutionSpace& space, const char mode[], + const AlphaType& alpha, const AMatrix& A, + const XVector& x, const BetaType& beta, + const YVector& y); +}; + +#if defined(KOKKOSKERNELS_INST_LAYOUTSTRIDE) || !defined(KOKKOSKERNELS_ETI_ONLY) +template +struct SPMV2D1D { + static bool spmv2d1d(const char mode[], const AlphaType& alpha, + const AMatrix& A, const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(typename AMatrix::execution_space{}, mode, alpha, A, x, beta, y); + return true; + } + + template + static bool spmv2d1d(const ExecutionSpace& space, const char mode[], + const AlphaType& alpha, const AMatrix& A, + const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(space, mode, alpha, A, x, beta, y); + return true; + } +}; + +#else + +template +struct SPMV2D1D { + static bool spmv2d1d(const char /*mode*/[], const AlphaType& /*alpha*/, + const AMatrix& /*A*/, const XVector& /*x*/, + const BetaType& /*beta*/, const YVector& /*y*/) { + return false; + } + + template + static bool spmv2d1d(const ExecutionSpace& /* space */, const char /*mode*/[], + const AlphaType& /*alpha*/, const AMatrix& /*A*/, + const XVector& /*x*/, const BetaType& /*beta*/, + const YVector& /*y*/) { + return false; + } +}; +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || !defined(KOKKOSKERNELS_ETI_ONLY) +template +struct SPMV2D1D { + static bool spmv2d1d(const char mode[], const AlphaType& alpha, + const AMatrix& A, const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(typename AMatrix::execution_space{}, mode, alpha, A, x, beta, y); + return true; + } + + template + static bool spmv2d1d(const ExecutionSpace& space, const char mode[], + const AlphaType& alpha, const AMatrix& A, + const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(space, mode, alpha, A, x, beta, y); + return true; + } +}; + +#else + +template +struct SPMV2D1D { + static bool spmv2d1d(const char /*mode*/[], const AlphaType& /*alpha*/, + const AMatrix& /*A*/, const XVector& /*x*/, + const BetaType& /*beta*/, const YVector& /*y*/) { + return false; + } + + template + static bool spmv2d1d(const ExecutionSpace& /* space */, const char /*mode*/[], + const AlphaType& /*alpha*/, const AMatrix& /*A*/, + const XVector& /*x*/, const BetaType& /*beta*/, + const YVector& /*y*/) { + return false; + } +}; +#endif + +#if defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) || !defined(KOKKOSKERNELS_ETI_ONLY) +template +struct SPMV2D1D { + static bool spmv2d1d(const char mode[], const AlphaType& alpha, + const AMatrix& A, const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(typename AMatrix::execution_space{}, mode, alpha, A, x, beta, y); + return true; + } + + template + static bool spmv2d1d(const ExecutionSpace& space, const char mode[], + const AlphaType& alpha, const AMatrix& A, + const XVector& x, const BetaType& beta, + const YVector& y) { + spmv(space, mode, alpha, A, x, beta, y); + return true; + } +}; + +#else + +template +struct SPMV2D1D { + static bool spmv2d1d(const char /*mode*/[], const AlphaType& /*alpha*/, + const AMatrix& /*A*/, const XVector& /*x*/, + const BetaType& /*beta*/, const YVector& /*y*/) { + return false; + } + + template + static bool spmv2d1d(const ExecutionSpace& /* space */, const char /*mode*/[], + const AlphaType& /*alpha*/, const AMatrix& /*A*/, + const XVector& /*x*/, const BetaType& /*beta*/, + const YVector& /*y*/) { + return false; + } +}; +#endif +} // namespace Impl + +template +using SPMV2D1D + [[deprecated("KokkosSparse::SPMV2D1D is not part of the public interface - " + "use KokkosSparse::spmv instead")]] = + Impl::SPMV2D1D; + +template +[ + [deprecated("Use the version of spmv that takes a SPMVHandle instead of " + "Controls")]] void +spmv(const ExecutionSpace& space, + KokkosKernels::Experimental::Controls controls, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y) { + // Default to fast setup, since this handle can't be reused + SPMVAlgorithm algo = SPMV_FAST_SETUP; + // Translate the Controls algorithm selection to the SPMVHandle algorithm. + // This maintains the old behavior, where any manually set name that isn't + // "tpl" gives native. + // + // This also uses the behavior set by #2021: "merge" was a hint to use + // cuSPARSE merge path, but that path is gone so just use the normal TPL. + // "merge-path" means to use the KK merge-path implementation. + // + // And also support the 3 different BSR algorithms by their old names. + if (controls.isParameter("algorithm")) { + std::string algoName = controls.getParameter("algorithm"); + if (algoName == "merge" || algoName == "tpl") + algo = SPMV_FAST_SETUP; + else if (algoName == "native-merge") + algo = SPMV_MERGE_PATH; + else if (algoName == "v4.1") + algo = SPMV_BSR_V41; + else if (algoName == "v4.2") + algo = SPMV_BSR_V41; + else if (algoName == "experimental_bsr_tc" || algoName == "experimental_tc") + algo = SPMV_BSR_TC; + else + throw std::invalid_argument( + std::string("KokkosSparse::spmv: controls algorithm name '") + + algoName + "' is not supported.\n"); + } + KokkosSparse::SPMVHandle handle( + algo); + // Pull out any expert tuning parameters + if (controls.isParameter("schedule")) { + if (controls.getParameter("schedule") == "dynamic") { + handle.force_dynamic_schedule = true; + } else if (controls.getParameter("schedule") == "static") { + handle.force_static_schedule = true; + } + } + if (controls.isParameter("team size")) + handle.team_size = std::stoi(controls.getParameter("team size")); + if (controls.isParameter("vector length")) + handle.vector_length = std::stoi(controls.getParameter("vector length")); + if (controls.isParameter("rows per thread")) + handle.rows_per_thread = + std::stoll(controls.getParameter("rows per thread")); + spmv(space, &handle, mode, alpha, A, x, beta, y); +} + +template +[ + [deprecated("Use the version of spmv that takes a SPMVHandle instead of " + "Controls")]] void +spmv(KokkosKernels::Experimental::Controls controls, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y) { + spmv(typename AMatrix::execution_space{}, controls, mode, alpha, A, x, beta, + y); +} + +template +[ + [deprecated("Use the version of spmv that takes a SPMVHandle instead of " + "Controls")]] void +spmv(const ExecutionSpace& space, + KokkosKernels::Experimental::Controls controls, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y, const RANK_ONE&) { + spmv(space, controls, mode, alpha, A, x, beta, y); +} + +template +[ + [deprecated("Use the version of spmv that takes a SPMVHandle instead of " + "Controls")]] void +spmv(KokkosKernels::Experimental::Controls controls, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y, const RANK_ONE&) { + spmv(controls, mode, alpha, A, x, beta, y); +} + +template +[ + [deprecated("Use the version of spmv that takes a SPMVHandle instead of " + "Controls")]] void +spmv(const ExecutionSpace& space, + KokkosKernels::Experimental::Controls controls, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y, const RANK_TWO&) { + spmv(space, controls, mode, alpha, A, x, beta, y); +} + +template +[ + [deprecated("Use the version of spmv that takes a SPMVHandle instead of " + "Controls")]] void +spmv(KokkosKernels::Experimental::Controls controls, const char mode[], + const AlphaType& alpha, const AMatrix& A, const XVector& x, + const BetaType& beta, const YVector& y, const RANK_TWO&) { + spmv(controls, mode, alpha, A, x, beta, y); +} + +} // namespace KokkosSparse + +#endif diff --git a/sparse/src/KokkosSparse_spmv_handle.hpp b/sparse/src/KokkosSparse_spmv_handle.hpp new file mode 100644 index 0000000000..9e7295c72c --- /dev/null +++ b/sparse/src/KokkosSparse_spmv_handle.hpp @@ -0,0 +1,389 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSSPARSE_SPMV_HANDLE_HPP_ +#define KOKKOSSPARSE_SPMV_HANDLE_HPP_ + +#include +#include "KokkosSparse_CrsMatrix.hpp" +#include "KokkosSparse_BsrMatrix.hpp" +// Use TPL utilities for safely finalizing matrix descriptors, etc. +#include "KokkosSparse_Utils_cusparse.hpp" +#include "KokkosSparse_Utils_rocsparse.hpp" +#include "KokkosSparse_Utils_mkl.hpp" + +namespace KokkosSparse { + +/// SPMVAlgorithm values can be used to select different algorithms/methods for +/// performing SpMV computations. +enum SPMVAlgorithm { + SPMV_DEFAULT, /// Default algorithm: best overall performance for repeated + /// applications of SpMV. + SPMV_FAST_SETUP, /// Best performance in the non-reuse case, where the handle + /// is only used once. + SPMV_NATIVE, /// Use the best KokkosKernels implementation, even if a TPL + /// implementation is available. + SPMV_MERGE_PATH, /// Use load-balancing merge path algorithm (for CrsMatrix + /// only) + SPMV_BSR_V41, /// Use experimental version 4.1 algorithm (for BsrMatrix only) + SPMV_BSR_V42, /// Use experimental version 4.2 algorithm (for BsrMatrix only) + SPMV_BSR_TC /// Use experimental tensor core algorithm (for BsrMatrix only) +}; + +namespace Experimental { +/// Precision to use in the tensor core implementation of Bsr SpMV +enum class Bsr_TC_Precision { + Automatic, ///< Use Double, unless operations match mixed precision + Double, ///< fp64 += fp64 * fp64 + Mixed ///< fp32 += fp16 * fp16 +}; +} // namespace Experimental + +/// Get the name of a SPMVAlgorithm enum constant +inline const char* get_spmv_algorithm_name(SPMVAlgorithm a) { + switch (a) { + case SPMV_DEFAULT: return "SPMV_DEFAULT"; + case SPMV_FAST_SETUP: return "SPMV_FAST_SETUP"; + case SPMV_NATIVE: return "SPMV_NATIVE"; + case SPMV_MERGE_PATH: return "SPMV_MERGE_PATH"; + case SPMV_BSR_V41: return "SPMV_BSR_V41"; + case SPMV_BSR_V42: return "SPMV_BSR_V42"; + case SPMV_BSR_TC: return "SPMV_BSR_TC"; + } + throw std::invalid_argument( + "SPMVHandle::get_algorithm_name: unknown algorithm"); + return ""; +} + +/// Return true if the given algorithm is always a native (KokkosKernels) +/// implementation, and false if it may be implemented by a TPL. +inline bool is_spmv_algorithm_native(SPMVAlgorithm a) { + switch (a) { + case SPMV_NATIVE: + case SPMV_MERGE_PATH: + case SPMV_BSR_V41: + case SPMV_BSR_V42: + case SPMV_BSR_TC: return true; + default: return false; + } +} + +namespace Impl { + +template +struct TPL_SpMV_Data { + // Disallow default construction: must provide the initial execution space + TPL_SpMV_Data() = delete; + TPL_SpMV_Data(const ExecutionSpace& exec_) : exec(exec_) {} + void set_exec_space(const ExecutionSpace& new_exec) { + // Check if new_exec is different from (old) exec. + // If it is, fence the old exec now. + // That way, SPMVHandle cleanup doesn't need + // to worry about resources still being in use on the old exec. + if (exec != new_exec) { + exec.fence(); + exec = new_exec; + } + } + virtual ~TPL_SpMV_Data() {} + ExecutionSpace exec; +}; + +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE +#if defined(CUSPARSE_VERSION) && (10300 <= CUSPARSE_VERSION) +// Data used by cuSPARSE >=10.3 for both single-vector (SpMV) and multi-vector +// (SpMM). +// TODO: in future, this can also be used for BSR (cuSPARSE >=12.2) +struct CuSparse10_SpMV_Data : public TPL_SpMV_Data { + CuSparse10_SpMV_Data(const Kokkos::Cuda& exec_) : TPL_SpMV_Data(exec_) {} + ~CuSparse10_SpMV_Data() { + // Prefer cudaFreeAsync on the stream that last executed a spmv, but + // async memory management was introduced in 11.2 +#if (CUDA_VERSION >= 11020) + KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFreeAsync(buffer, exec.cuda_stream())); +#else + // Fence here to ensure spmv is not still using buffer + // (cudaFree does not do a device synchronize) + exec.fence(); + KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFree(buffer)); +#endif + KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroySpMat(mat)); + } + + cusparseSpMatDescr_t mat; + size_t bufferSize = 0; + void* buffer = nullptr; +}; +#endif + +// Data used by cuSPARSE <10.3 for CRS, and >=9 for BSR +struct CuSparse9_SpMV_Data : public TPL_SpMV_Data { + CuSparse9_SpMV_Data(const Kokkos::Cuda& exec_) : TPL_SpMV_Data(exec_) {} + ~CuSparse9_SpMV_Data() { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyMatDescr(mat)); + } + + cusparseMatDescr_t mat; +}; +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE +struct RocSparse_CRS_SpMV_Data : public TPL_SpMV_Data { + RocSparse_CRS_SpMV_Data(const Kokkos::HIP& exec_) : TPL_SpMV_Data(exec_) {} + ~RocSparse_CRS_SpMV_Data() { + // note: hipFree includes an implicit device synchronize + KOKKOS_IMPL_HIP_SAFE_CALL(hipFree(buffer)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_spmat_descr(mat)); + } + + rocsparse_spmat_descr mat; + size_t bufferSize = 0; + void* buffer = nullptr; +}; + +struct RocSparse_BSR_SpMV_Data : public TPL_SpMV_Data { + RocSparse_BSR_SpMV_Data(const Kokkos::HIP& exec_) : TPL_SpMV_Data(exec_) {} + ~RocSparse_BSR_SpMV_Data() { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_mat_descr(mat)); +#if (KOKKOSSPARSE_IMPL_ROCM_VERSION >= 50400) + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_mat_info(info)); +#endif + } + + rocsparse_mat_descr mat; +#if (KOKKOSSPARSE_IMPL_ROCM_VERSION >= 50400) + rocsparse_mat_info info; +#endif +}; +#endif + +// note: header defining __INTEL_MKL__ is pulled in above by Utils_mkl.hpp +#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL + +#if (__INTEL_MKL__ > 2017) +// Data for classic MKL (both CRS and BSR) +template +struct MKL_SpMV_Data : public TPL_SpMV_Data { + MKL_SpMV_Data(const ExecutionSpace& exec_) + : TPL_SpMV_Data(exec_) {} + ~MKL_SpMV_Data() { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_destroy(mat)); + // descr is just a plain-old-data struct, no cleanup to do + } + + sparse_matrix_t mat; + matrix_descr descr; +}; +#endif + +#if defined(KOKKOS_ENABLE_SYCL) && \ + !defined(KOKKOSKERNELS_ENABLE_TPL_MKL_SYCL_OVERRIDE) +struct OneMKL_SpMV_Data : public TPL_SpMV_Data { + OneMKL_SpMV_Data(const Kokkos::Experimental::SYCL& exec_) + : TPL_SpMV_Data(exec_) {} + ~OneMKL_SpMV_Data() { + // Make sure no spmv is still running with this handle, if exec uses an + // out-of-order queue (rare case) + if (!exec.sycl_queue().is_in_order()) exec.fence(); +#if INTEL_MKL_VERSION >= 20230200 + // MKL 2023.2 and up make this async release okay even though it takes a + // pointer to mat, which is going out of scope after this destructor + oneapi::mkl::sparse::release_matrix_handle(exec.sycl_queue(), &mat); +#else + // But in older versions, wait on ev_release before letting mat go out of + // scope + auto ev_release = + oneapi::mkl::sparse::release_matrix_handle(exec.sycl_queue(), &mat); + ev_release.wait(); +#endif + } + + oneapi::mkl::sparse::matrix_handle_t mat; +}; +#endif +#endif + +template +struct SPMVHandleImpl { + using ExecutionSpaceType = ExecutionSpace; + // This is its own ImplType + using ImplType = + SPMVHandleImpl; + // Do not allow const qualifier on Scalar, Ordinal, Offset (otherwise this + // type won't match the ETI'd type). Users should not use SPMVHandleImpl + // directly and SPMVHandle explicitly removes const, so this should never + // happen in practice. + static_assert(!std::is_const_v, + "SPMVHandleImpl: Scalar must not be a const type"); + static_assert(!std::is_const_v, + "SPMVHandleImpl: Offset must not be a const type"); + static_assert(!std::is_const_v, + "SPMVHandleImpl: Ordinal must not be a const type"); + SPMVHandleImpl(SPMVAlgorithm algo_) : algo(algo_) {} + ~SPMVHandleImpl() { + if (tpl) delete tpl; + } + void set_exec_space(const ExecutionSpace& exec) { + if (tpl) tpl->set_exec_space(exec); + } + + /// Get the SPMVAlgorithm used by this handle + SPMVAlgorithm get_algorithm() const { return this->algo; } + + bool is_set_up = false; + const SPMVAlgorithm algo = SPMV_DEFAULT; + TPL_SpMV_Data* tpl = nullptr; + // Expert tuning parameters for native SpMV + // TODO: expose a proper Experimental interface to set these. Currently they + // can be assigned directly in the SPMVHandle as they are public members. + int team_size = -1; + int vector_length = -1; + int64_t rows_per_thread = -1; + bool force_static_schedule = false; + bool force_dynamic_schedule = false; + KokkosSparse::Experimental::Bsr_TC_Precision bsr_tc_precision = + KokkosSparse::Experimental::Bsr_TC_Precision::Automatic; +}; +} // namespace Impl + +// clang-format off +/// \class SPMVHandle +/// \brief Opaque handle type for KokkosSparse::spmv. It passes the choice of +/// algorithm to the spmv implementation, and also may store internal data which can be used to +/// speed up the spmv computation. +/// \tparam DeviceType A Kokkos::Device or execution space where the spmv computation will be run. +/// Does not necessarily need to match AMatrix's device type, but its execution space needs to be able +/// to access the memory spaces of AMatrix, XVector and YVector. +/// \tparam AMatrix A specialization of KokkosSparse::CrsMatrix or +/// KokkosSparse::BsrMatrix. +/// +/// SPMVHandle's internal resources are lazily allocated and initialized by the first +/// spmv call. +/// +/// SPMVHandle automatically cleans up all allocated resources when it is destructed. +/// No fencing by the user is required between the final spmv and cleanup. +/// +/// A SPMVHandle instance can be used in any number of calls, with any execution space +/// instance and any X/Y vectors (with matching types) each call. +/// +/// \warning However, all calls to spmv with a given instance of SPMVHandle must use the +/// same matrix. +// clang-format on + +template +struct SPMVHandle + : public Impl::SPMVHandleImpl { + using ImplType = + Impl::SPMVHandleImpl; + // Note: these typedef names cannot shadow template parameters + using AMatrixType = AMatrix; + using XVectorType = XVector; + using YVectorType = YVector; + using ExecutionSpaceType = typename DeviceType::execution_space; + // Check all template parameters for compatibility with each other + // NOTE: we do not require that ExecutionSpace matches + // AMatrix::execution_space. For example, if the matrix's device is it is allowed to run spmv on Serial. + static_assert(is_crs_matrix_v || + Experimental::is_bsr_matrix_v, + "SPMVHandle: AMatrix must be a specialization of CrsMatrix or " + "BsrMatrix."); + static_assert(Kokkos::is_view::value, + "SPMVHandle: XVector must be a Kokkos::View."); + static_assert(Kokkos::is_view::value, + "SPMVHandle: YVector must be a Kokkos::View."); + static_assert(XVector::rank() == YVector::rank(), + "SPMVHandle: ranks of XVector and YVector must match."); + static_assert( + XVector::rank() == size_t(1) || YVector::rank() == size_t(2), + "SPMVHandle: XVector and YVector must be both rank-1 or both rank-2."); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "SPMVHandle: AMatrix must be accessible from ExecutionSpace"); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "SPMVHandle: XVector must be accessible from ExecutionSpace"); + static_assert( + Kokkos::SpaceAccessibility::accessible, + "SPMVHandle: YVector must be accessible from ExecutionSpace"); + + // Prevent copying (this object does not support reference counting) + SPMVHandle(const SPMVHandle&) = delete; + SPMVHandle& operator=(const SPMVHandle&) = delete; + + /// \brief Create a new SPMVHandle using the given algorithm. + SPMVHandle(SPMVAlgorithm algo_ = SPMV_DEFAULT) : ImplType(algo_) { + // Validate the choice of algorithm based on A's type + if constexpr (is_crs_matrix_v) { + switch (get_algorithm()) { + case SPMV_BSR_V41: + case SPMV_BSR_V42: + case SPMV_BSR_TC: + throw std::invalid_argument(std::string("SPMVHandle: algorithm ") + + get_spmv_algorithm_name(get_algorithm()) + + " cannot be used if A is a CrsMatrix"); + default:; + } + } else { + switch (get_algorithm()) { + case SPMV_MERGE_PATH: + throw std::invalid_argument(std::string("SPMVHandle: algorithm ") + + get_spmv_algorithm_name(get_algorithm()) + + " cannot be used if A is a BsrMatrix"); + default:; + } + } + } + + /// Get the SPMVAlgorithm used by this handle + SPMVAlgorithm get_algorithm() const { + // Note: get_algorithm is also a method of parent ImplType, but for + // documentation purposes it should appear directly in the public interface + // of SPMVHandle + return this->algo; + } + + /// Get pointer to this as the impl type + ImplType* get_impl() { return static_cast(this); } +}; + +namespace Impl { +template +struct is_spmv_handle : public std::false_type {}; +template +struct is_spmv_handle> : public std::true_type {}; +template +struct is_spmv_handle> : public std::true_type {}; + +template +inline constexpr bool is_spmv_handle_v = is_spmv_handle::value; +} // namespace Impl + +} // namespace KokkosSparse + +#endif diff --git a/sparse/src/KokkosSparse_sptrsv.hpp b/sparse/src/KokkosSparse_sptrsv.hpp index 859918c58d..1fef3e9f1b 100644 --- a/sparse/src/KokkosSparse_sptrsv.hpp +++ b/sparse/src/KokkosSparse_sptrsv.hpp @@ -40,10 +40,23 @@ namespace Experimental { std::is_same::type, \ typename std::remove_const::type>::value -template -void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, - lno_nnz_view_t_ entries) { +/** + * @brief sptrsv symbolic phase for linear system Ax=b + * + * @tparam ExecutionSpace This kernels execution space type + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam lno_row_view_t_ The CRS matrix's (A) rowmap type + * @tparam lno_nnz_view_t_ The CRS matrix's (A) entries type + * @param space The execution space instance this kernel will run on + * @param handle KernelHandle instance + * @param rowmap The CRS matrix's (A) rowmap + * @param entries The CRS matrix's (A) entries + */ +template +void sptrsv_symbolic(const ExecutionSpace &space, KernelHandle *handle, + lno_row_view_t_ rowmap, lno_nnz_view_t_ entries) { typedef typename KernelHandle::size_type size_type; typedef typename KernelHandle::nnz_lno_t ordinal_type; @@ -94,8 +107,9 @@ void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, Entries_Internal entries_i = entries; KokkosSparse::Impl::SPTRSV_SYMBOLIC< - const_handle_type, RowMap_Internal, - Entries_Internal>::sptrsv_symbolic(&tmp_handle, rowmap_i, entries_i); + ExecutionSpace, const_handle_type, RowMap_Internal, + Entries_Internal>::sptrsv_symbolic(space, &tmp_handle, rowmap_i, + entries_i); #ifdef KK_TRISOLVE_TIMERS std::cout << " > sptrsv_symbolic time = " << timer_sptrsv.seconds() @@ -103,14 +117,54 @@ void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, #endif } // sptrsv_symbolic +/** + * @brief sptrsv symbolic phase for linear system Ax=b + * + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam lno_row_view_t_ The CRS matrix's (A) rowmap type + * @tparam lno_nnz_view_t_ The CRS matrix's (A) entries type + * @param handle KernelHandle instance + * @param rowmap The CRS matrix's (A) rowmap + * @param entries The CRS matrix's (A) entries + */ template + typename lno_nnz_view_t_> void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, - lno_nnz_view_t_ entries, scalar_nnz_view_t_ values) { + lno_nnz_view_t_ entries) { + using ExecutionSpace = typename KernelHandle::HandleExecSpace; + auto my_exec_space = ExecutionSpace(); + sptrsv_symbolic(my_exec_space, handle, rowmap, entries); +} + +/** + * @brief sptrsv symbolic phase for linear system Ax=b + * + * @tparam ExecutionSpace This kernels execution space type + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam lno_row_view_t_ The CRS matrix's (A) rowmap type + * @tparam lno_nnz_view_t_ The CRS matrix's (A) entries type + * @param space The execution space instance this kernel will run on + * @param handle KernelHandle instance + * @param rowmap The CRS matrix's (A) rowmap + * @param entries The CRS matrix's (A) entries + * @param values The CRS matrix's (A) values + */ +template +void sptrsv_symbolic(ExecutionSpace &space, KernelHandle *handle, + lno_row_view_t_ rowmap, lno_nnz_view_t_ entries, + scalar_nnz_view_t_ values) { typedef typename KernelHandle::size_type size_type; typedef typename KernelHandle::nnz_lno_t ordinal_type; typedef typename KernelHandle::nnz_scalar_t scalar_type; + static_assert( + std::is_same_v, + "sptrsv_symbolic: ExecutionSpace and HandleExecSpace need to match!"); + static_assert(KOKKOSKERNELS_SPTRSV_SAME_TYPE( typename lno_row_view_t_::non_const_value_type, size_type), "sptrsv_symbolic: A size_type must match KernelHandle " @@ -140,50 +194,60 @@ void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, const_handle_type; const_handle_type tmp_handle(*handle); - typedef Kokkos::View< - typename lno_row_view_t_::const_value_type *, - typename KokkosKernels::Impl::GetUnifiedLayout< - lno_row_view_t_>::array_layout, - typename lno_row_view_t_::device_type, - Kokkos::MemoryTraits > - RowMap_Internal; - - typedef Kokkos::View< - typename lno_nnz_view_t_::const_value_type *, - typename KokkosKernels::Impl::GetUnifiedLayout< - lno_nnz_view_t_>::array_layout, - typename lno_nnz_view_t_::device_type, - Kokkos::MemoryTraits > - Entries_Internal; - - typedef Kokkos::View< - typename scalar_nnz_view_t_::const_value_type *, - typename KokkosKernels::Impl::GetUnifiedLayout< - scalar_nnz_view_t_>::array_layout, - typename scalar_nnz_view_t_::device_type, - Kokkos::MemoryTraits > - Values_Internal; - #ifdef KK_TRISOLVE_TIMERS Kokkos::Timer timer_sptrsv; #endif auto sptrsv_handle = handle->get_sptrsv_handle(); if (sptrsv_handle->get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm::SPTRSV_CUSPARSE) { - RowMap_Internal rowmap_i = rowmap; - Entries_Internal entries_i = entries; - Values_Internal values_i = values; - - typedef typename KernelHandle::SPTRSVHandleType sptrsvHandleType; - sptrsvHandleType *sh = handle->get_sptrsv_handle(); - auto nrows = sh->get_nrows(); - - KokkosSparse::Impl::sptrsvcuSPARSE_symbolic< - sptrsvHandleType, RowMap_Internal, Entries_Internal, Values_Internal>( - sh, nrows, rowmap_i, entries_i, values_i, false); - +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + if constexpr (std::is_same_v) { + using RowMap_Internal = Kokkos::View< + typename lno_row_view_t_::const_value_type *, + typename KokkosKernels::Impl::GetUnifiedLayout< + lno_row_view_t_>::array_layout, + typename lno_row_view_t_::device_type, + Kokkos::MemoryTraits >; + + using Entries_Internal = Kokkos::View< + typename lno_nnz_view_t_::const_value_type *, + typename KokkosKernels::Impl::GetUnifiedLayout< + lno_nnz_view_t_>::array_layout, + typename lno_nnz_view_t_::device_type, + Kokkos::MemoryTraits >; + + using Values_Internal = Kokkos::View< + typename scalar_nnz_view_t_::const_value_type *, + typename KokkosKernels::Impl::GetUnifiedLayout< + scalar_nnz_view_t_>::array_layout, + typename scalar_nnz_view_t_::device_type, + Kokkos::MemoryTraits >; + + RowMap_Internal rowmap_i = rowmap; + Entries_Internal entries_i = entries; + Values_Internal values_i = values; + + typedef typename KernelHandle::SPTRSVHandleType sptrsvHandleType; + sptrsvHandleType *sh = handle->get_sptrsv_handle(); + auto nrows = sh->get_nrows(); + + KokkosSparse::Impl::sptrsvcuSPARSE_symbolic< + ExecutionSpace, sptrsvHandleType, RowMap_Internal, Entries_Internal, + Values_Internal>(space, sh, nrows, rowmap_i, entries_i, values_i, + false); + } else { + (void)values; + KokkosSparse::Experimental::sptrsv_symbolic(space, handle, rowmap, + entries); + } + +#else // We better go to the native implementation + (void)values; + KokkosSparse::Experimental::sptrsv_symbolic(space, handle, rowmap, entries); +#endif } else { - KokkosSparse::Experimental::sptrsv_symbolic(handle, rowmap, entries); + (void)values; + KokkosSparse::Experimental::sptrsv_symbolic(space, handle, rowmap, entries); } #ifdef KK_TRISOLVE_TIMERS std::cout << " + sptrsv_symbolic time = " << timer_sptrsv.seconds() @@ -191,16 +255,61 @@ void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, #endif } // sptrsv_symbolic +/** + * @brief sptrsv symbolic phase for linear system Ax=b + * + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam lno_row_view_t_ The CRS matrix's (A) rowmap type + * @tparam lno_nnz_view_t_ The CRS matrix's (A) entries type + * @param handle KernelHandle instance + * @param rowmap The CRS matrix's (A) rowmap + * @param entries The CRS matrix's (A) entries + * @param values The CRS matrix's (A) values + */ template -void sptrsv_solve(KernelHandle *handle, lno_row_view_t_ rowmap, - lno_nnz_view_t_ entries, scalar_nnz_view_t_ values, BType b, - XType x) { + typename lno_nnz_view_t_, typename scalar_nnz_view_t_> +void sptrsv_symbolic(KernelHandle *handle, lno_row_view_t_ rowmap, + lno_nnz_view_t_ entries, scalar_nnz_view_t_ values) { + using ExecutionSpace = typename KernelHandle::HandleExecSpace; + auto my_exec_space = ExecutionSpace(); + + sptrsv_symbolic(my_exec_space, handle, rowmap, entries, values); +} + +/** + * @brief sptrsv solve phase of x for linear system Ax=b + * + * @tparam ExecutionSpace This kernels execution space + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam lno_row_view_t_ The CRS matrix's (A) rowmap type + * @tparam lno_nnz_view_t_ The CRS matrix's (A) entries type + * @tparam scalar_nnz_view_t_ The CRS matrix's (A) values type + * @tparam BType The b vector type + * @tparam XType The x vector type + * @param space The execution space instance this kernel will be run on + * @param handle KernelHandle instance + * @param rowmap The CRS matrix's (A) rowmap + * @param entries The CRS matrix's (A) entries + * @param values The CRS matrix's (A) values + * @param b The b vector + * @param x The x vector + */ +template +void sptrsv_solve(ExecutionSpace &space, KernelHandle *handle, + lno_row_view_t_ rowmap, lno_nnz_view_t_ entries, + scalar_nnz_view_t_ values, BType b, XType x) { typedef typename KernelHandle::size_type size_type; typedef typename KernelHandle::nnz_lno_t ordinal_type; typedef typename KernelHandle::nnz_scalar_t scalar_type; + static_assert( + std::is_same_v, + "sptrsv solve: ExecutionSpace and HandleExecSpace need to match"); + static_assert(KOKKOSKERNELS_SPTRSV_SAME_TYPE( typename lno_row_view_t_::non_const_value_type, size_type), "sptrsv_solve: A size_type must match KernelHandle size_type " @@ -301,29 +410,84 @@ void sptrsv_solve(KernelHandle *handle, lno_row_view_t_ rowmap, auto sptrsv_handle = handle->get_sptrsv_handle(); if (sptrsv_handle->get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm::SPTRSV_CUSPARSE) { - typedef typename KernelHandle::SPTRSVHandleType sptrsvHandleType; - sptrsvHandleType *sh = handle->get_sptrsv_handle(); - auto nrows = sh->get_nrows(); - - KokkosSparse::Impl::sptrsvcuSPARSE_solve( - sh, nrows, rowmap_i, entries_i, values_i, b_i, x_i, false); - +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + if constexpr (std::is_same_v) { + typedef typename KernelHandle::SPTRSVHandleType sptrsvHandleType; + sptrsvHandleType *sh = handle->get_sptrsv_handle(); + auto nrows = sh->get_nrows(); + + KokkosSparse::Impl::sptrsvcuSPARSE_solve< + ExecutionSpace, sptrsvHandleType, RowMap_Internal, Entries_Internal, + Values_Internal, BType_Internal, XType_Internal>( + space, sh, nrows, rowmap_i, entries_i, values_i, b_i, x_i, false); + } else { + KokkosSparse::Impl::SPTRSV_SOLVE< + ExecutionSpace, const_handle_type, RowMap_Internal, Entries_Internal, + Values_Internal, BType_Internal, + XType_Internal>::sptrsv_solve(space, &tmp_handle, rowmap_i, entries_i, + values_i, b_i, x_i); + } +#else + KokkosSparse::Impl::SPTRSV_SOLVE< + ExecutionSpace, const_handle_type, RowMap_Internal, Entries_Internal, + Values_Internal, BType_Internal, + XType_Internal>::sptrsv_solve(space, &tmp_handle, rowmap_i, entries_i, + values_i, b_i, x_i); +#endif } else { KokkosSparse::Impl::SPTRSV_SOLVE< - typename scalar_nnz_view_t_::execution_space, const_handle_type, - RowMap_Internal, Entries_Internal, Values_Internal, BType_Internal, - XType_Internal>::sptrsv_solve(&tmp_handle, rowmap_i, entries_i, + ExecutionSpace, const_handle_type, RowMap_Internal, Entries_Internal, + Values_Internal, BType_Internal, + XType_Internal>::sptrsv_solve(space, &tmp_handle, rowmap_i, entries_i, values_i, b_i, x_i); } } // sptrsv_solve -#if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) -// --------------------------------------------------------------------- -template -void sptrsv_solve(KernelHandle *handle, XType x, XType b) { +/** + * @brief sptrsv solve phase of x for linear system Ax=b + * + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam lno_row_view_t_ The CRS matrix's (A) rowmap type + * @tparam lno_nnz_view_t_ The CRS matrix's (A) entries type + * @tparam scalar_nnz_view_t_ The CRS matrix's (A) values type + * @tparam BType The b vector type + * @tparam XType The x vector type + * @param handle KernelHandle instance + * @param rowmap The CRS matrix's (A) rowmap + * @param entries The CRS matrix's (A) entries + * @param values The CRS matrix's (A) values + * @param b The b vector + * @param x The x vector + */ +template +void sptrsv_solve(KernelHandle *handle, lno_row_view_t_ rowmap, + lno_nnz_view_t_ entries, scalar_nnz_view_t_ values, BType b, + XType x) { + using ExecutionSpace = typename KernelHandle::HandleExecSpace; + auto my_exec_space = ExecutionSpace(); + sptrsv_solve(my_exec_space, handle, rowmap, entries, values, b, x); +} + +#if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) || defined(DOXY) +/** + * @brief Supernodal sptrsv solve phase of x for linear system Ax=b + * + * @tparam ExecutionSpace This kernels execution space + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam XType The x and b vector type + * @param space The execution space instance this kernel will run on + * @param handle KernelHandle instance + * @param x The x vector + * @param b The b vector + */ +template +void sptrsv_solve(ExecutionSpace &space, KernelHandle *handle, XType x, + XType b) { auto crsmat = handle->get_sptrsv_handle()->get_crsmat(); auto values = crsmat.values; auto graph = crsmat.graph; @@ -341,31 +505,79 @@ void sptrsv_solve(KernelHandle *handle, XType x, XType b) { if (handle->is_sptrsv_lower_tri()) { // apply forward pivoting - Kokkos::deep_copy(x, b); + Kokkos::deep_copy(space, x, b); // the fifth argument (i.e., first x) is not used - sptrsv_solve(handle, row_map, entries, values, x, x); + sptrsv_solve(space, handle, row_map, entries, values, x, x); } else { // the fifth argument (i.e., first x) is not used - sptrsv_solve(handle, row_map, entries, values, b, b); + sptrsv_solve(space, handle, row_map, entries, values, b, b); // apply backward pivoting - Kokkos::deep_copy(x, b); + Kokkos::deep_copy(space, x, b); } } -// --------------------------------------------------------------------- +/** + * @brief Supernodal sptrsv solve phase of x for linear system Ax=b + * + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam XType The x and b vector type + * @param handle KernelHandle instance + * @param x The x vector + * @param b The b vector + */ template -void sptrsv_solve(KernelHandle *handleL, KernelHandle *handleU, XType x, - XType b) { +void sptrsv_solve(KernelHandle *handle, XType x, XType b) { + using ExecutionSpace = typename KernelHandle::HandleExecSpace; + auto my_exec_space = ExecutionSpace(); + sptrsv_solve(my_exec_space, handle, x, b); +} + +/** + * @brief Supernodal sptrsv solve phase of x for linear system Ax=b + * + * @tparam ExecutionSpace This kernels execution space + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam XType The x and b vector type + * @param space The execution space instance this kernel will run on + * @param handleL KernelHandle instance for lower triangular matrix + * @param handleU KernelHandle instance for upper triangular matrix + * @param x The x vector + * @param b The b vector + */ +template +void sptrsv_solve(ExecutionSpace &space, KernelHandle *handleL, + KernelHandle *handleU, XType x, XType b) { // Lower-triangular solve - sptrsv_solve(handleL, x, b); + sptrsv_solve(space, handleL, x, b); // copy the solution to rhs - Kokkos::deep_copy(b, x); + Kokkos::deep_copy(space, b, x); // uper-triangular solve - sptrsv_solve(handleU, x, b); + sptrsv_solve(space, handleU, x, b); +} + +/** + * @brief Supernodal sptrsv solve phase of x for linear system Ax=b + * + * @tparam KernelHandle A specialization of + * KokkosKernels::Experimental::KokkosKernelsHandle + * @tparam XType The x and b vector type + * @param handleL KernelHandle instance for lower triangular matrix + * @param handleU KernelHandle instance for upper triangular matrix + * @param x The x vector + * @param b The b vector + */ +template +void sptrsv_solve(KernelHandle *handleL, KernelHandle *handleU, XType x, + XType b) { + using ExecutionSpace = typename KernelHandle::HandleExecSpace; + auto my_exec_space = ExecutionSpace(); + sptrsv_solve(my_exec_space, handleL, handleU, x, b); } #endif @@ -569,13 +781,21 @@ void sptrsv_solve_streams(const std::vector &execspace_v, if (handle_v[0]->get_sptrsv_handle()->get_algorithm() == KokkosSparse::Experimental::SPTRSVAlgorithm::SPTRSV_CUSPARSE) { +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE // NOTE: assume all streams use the same SPTRSV_CUSPARSE algo. KokkosSparse::Impl::sptrsvcuSPARSE_solve_streams< ExecutionSpace, const_handle_type, RowMap_Internal, Entries_Internal, Values_Internal, BType_Internal, XType_Internal>( execspace_v, handle_i_v, rowmap_i_v, entries_i_v, values_i_v, b_i_v, x_i_v, false); - +#else + KokkosSparse::Impl::SPTRSV_SOLVE< + ExecutionSpace, const_handle_type, RowMap_Internal, Entries_Internal, + Values_Internal, BType_Internal, + XType_Internal>::sptrsv_solve_streams(execspace_v, handle_i_v, + rowmap_i_v, entries_i_v, + values_i_v, b_i_v, x_i_v); +#endif } else { KokkosSparse::Impl::SPTRSV_SOLVE< ExecutionSpace, const_handle_type, RowMap_Internal, Entries_Internal, diff --git a/sparse/src/KokkosSparse_sptrsv_handle.hpp b/sparse/src/KokkosSparse_sptrsv_handle.hpp index 7c9027d24a..cf23bfdc1f 100644 --- a/sparse/src/KokkosSparse_sptrsv_handle.hpp +++ b/sparse/src/KokkosSparse_sptrsv_handle.hpp @@ -476,6 +476,22 @@ class SPTRSVHandle { this->set_if_algm_require_symb_lvlsched(); this->set_if_algm_require_symb_chain(); + // Check a few prerequisites before allowing users + // to run with the cusparse implementation of sptrsv. + if (algm == SPTRSVAlgorithm::SPTRSV_CUSPARSE) { +#if !defined(KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) + throw( + std::runtime_error("sptrsv handle: SPTRSV_CUSPARSE requested but " + "cuSPARSE TPL not enabled.")); +#else + if (!std::is_same_v) { + throw( + std::runtime_error("sptrsv handle: SPTRSV_CUSPARSE requested but " + "HandleExecSpace is not Kokkos::CUDA.")); + } +#endif + } + #ifdef KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV if (lower_tri) { // lower-triangular is stored in CSC diff --git a/sparse/src/KokkosSparse_trsv.hpp b/sparse/src/KokkosSparse_trsv.hpp index 1363542f1b..9b25811d10 100644 --- a/sparse/src/KokkosSparse_trsv.hpp +++ b/sparse/src/KokkosSparse_trsv.hpp @@ -68,11 +68,15 @@ void trsv(const char uplo[], const char trans[], const char diag[], typename XMV::non_const_value_type>::value, "KokkosBlas::trsv: The output x must be nonconst."); + static_assert(KokkosSparse::is_crs_matrix::value || + KokkosSparse::Experimental::is_bsr_matrix::value, + "KokkosBlas::trsv: A is not a CRS or BSR matrix."); + // The following three code lines have been moved up by Massimiliano Lupo // Pasini typedef typename BMV::size_type size_type; - const size_type numRows = static_cast(A.numRows()); - const size_type numCols = static_cast(A.numCols()); + const size_type numRows = static_cast(A.numPointRows()); + const size_type numCols = static_cast(A.numPointCols()); const size_type zero = static_cast(0); if (zero != numRows && uplo[0] != 'U' && uplo[0] != 'u' && uplo[0] != 'L' && @@ -117,13 +121,21 @@ void trsv(const char uplo[], const char trans[], const char diag[], KokkosKernels::Impl::throw_runtime_exception(os.str()); } - typedef KokkosSparse::CrsMatrix< + using AMatrix_Bsr_Internal = KokkosSparse::Experimental::BsrMatrix< typename AMatrix::const_value_type, typename AMatrix::const_ordinal_type, typename AMatrix::device_type, Kokkos::MemoryTraits, - typename AMatrix::const_size_type> - AMatrix_Internal; - - AMatrix_Internal A_i = A; + typename AMatrix::const_size_type>; + + using AMatrix_Internal = std::conditional_t< + KokkosSparse::is_crs_matrix::value, + KokkosSparse::CrsMatrix, + typename AMatrix::const_size_type>, + AMatrix_Bsr_Internal>; + + AMatrix_Internal A_i(A); typedef Kokkos::View< typename BMV::const_value_type**, typename BMV::array_layout, diff --git a/sparse/tpls/KokkosSparse_spadd_numeric_tpl_spec_decl.hpp b/sparse/tpls/KokkosSparse_spadd_numeric_tpl_spec_decl.hpp new file mode 100644 index 0000000000..0952654bdf --- /dev/null +++ b/sparse/tpls/KokkosSparse_spadd_numeric_tpl_spec_decl.hpp @@ -0,0 +1,282 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_HPP_ +#define KOKKOSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_HPP_ + +namespace KokkosSparse { +namespace Impl { + +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + +#define KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE( \ + TOKEN, KOKKOS_SCALAR_TYPE, TPL_SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, \ + LAYOUT_TYPE, EXEC_SPACE_TYPE, MEM_SPACE_TYPE, ETI_SPEC_AVAIL) \ + template <> \ + struct SPADD_NUMERIC< \ + EXEC_SPACE_TYPE, \ + KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, ETI_SPEC_AVAIL> { \ + using kernelhandle_t = KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>; \ + using rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using non_const_rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using colidx_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using non_const_colidx_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using scalar_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using non_const_scalar_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + static void spadd_numeric( \ + const EXEC_SPACE_TYPE &exec, kernelhandle_t *handle, ORDINAL_TYPE m, \ + ORDINAL_TYPE n, const KOKKOS_SCALAR_TYPE alpha, rowmap_view_t rowmapA, \ + colidx_view_t colidxA, scalar_view_t valuesA, \ + const KOKKOS_SCALAR_TYPE beta, rowmap_view_t rowmapB, \ + colidx_view_t colidxB, scalar_view_t valuesB, rowmap_view_t rowmapC, \ + non_const_colidx_view_t colidxC, non_const_scalar_view_t valuesC) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosSparse::spadd_numeric[TPL_CUSPARSE," + \ + Kokkos::ArithTraits::name() + "]"); \ + \ + auto addHandle = handle->get_spadd_handle(); \ + auto &cuspData = addHandle->cusparseData; \ + auto &cuspHandle = \ + KokkosKernels::Impl::CusparseSingleton::singleton().cusparseHandle; \ + cusparsePointerMode_t oldPtrMode; \ + \ + KOKKOS_CUSPARSE_SAFE_CALL( \ + cusparseSetStream(cuspHandle, exec.cuda_stream())); \ + KOKKOS_CUSPARSE_SAFE_CALL( \ + cusparseGetPointerMode(cuspHandle, &oldPtrMode)); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSetPointerMode( \ + cuspHandle, CUSPARSE_POINTER_MODE_HOST)); /* alpha, beta on host*/ \ + OFFSET_TYPE nnzA = colidxA.extent(0); \ + OFFSET_TYPE nnzB = colidxB.extent(0); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparse##TOKEN##csrgeam2( \ + cuspHandle, m, n, reinterpret_cast(&alpha), \ + cuspData.descrA, nnzA, \ + reinterpret_cast(valuesA.data()), \ + rowmapA.data(), colidxA.data(), \ + reinterpret_cast(&beta), cuspData.descrB, \ + nnzB, reinterpret_cast(valuesB.data()), \ + rowmapB.data(), colidxB.data(), cuspData.descrC, \ + reinterpret_cast(valuesC.data()), \ + const_cast(rowmapC.data()), colidxC.data(), \ + cuspData.workspace)); \ + KOKKOS_CUSPARSE_SAFE_CALL( \ + cusparseSetPointerMode(cuspHandle, oldPtrMode)); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSetStream(cuspHandle, NULL)); \ + \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE_EXT(ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE( \ + S, float, float, int, int, Kokkos::LayoutLeft, Kokkos::Cuda, \ + Kokkos::CudaSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE( \ + D, double, double, int, int, Kokkos::LayoutLeft, Kokkos::Cuda, \ + Kokkos::CudaSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE( \ + C, Kokkos::complex, cuComplex, int, int, Kokkos::LayoutLeft, \ + Kokkos::Cuda, Kokkos::CudaSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE( \ + Z, Kokkos::complex, cuDoubleComplex, int, int, \ + Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, ETI_SPEC_AVAIL) + +KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE_EXT(true) +KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_CUSPARSE_EXT(false) +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + +#define KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE( \ + TOKEN, KOKKOS_SCALAR_TYPE, TPL_SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, \ + LAYOUT_TYPE, EXEC_SPACE_TYPE, MEM_SPACE_TYPE, ETI_SPEC_AVAIL) \ + template <> \ + struct SPADD_NUMERIC< \ + EXEC_SPACE_TYPE, \ + KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, ETI_SPEC_AVAIL> { \ + using kernelhandle_t = KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>; \ + using rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using non_const_rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using colidx_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using non_const_colidx_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using scalar_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + using non_const_scalar_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits>; \ + static void spadd_numeric( \ + const EXEC_SPACE_TYPE &exec, kernelhandle_t *handle, ORDINAL_TYPE m, \ + ORDINAL_TYPE n, const KOKKOS_SCALAR_TYPE alpha, rowmap_view_t rowmapA, \ + colidx_view_t colidxA, scalar_view_t valuesA, \ + const KOKKOS_SCALAR_TYPE beta, rowmap_view_t rowmapB, \ + colidx_view_t colidxB, scalar_view_t valuesB, rowmap_view_t rowmapC, \ + non_const_colidx_view_t colidxC, non_const_scalar_view_t valuesC) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosSparse::spadd_numeric[TPL_ROCSPARSE," + \ + Kokkos::ArithTraits::name() + "]"); \ + \ + auto addHandle = handle->get_spadd_handle(); \ + auto &rocData = addHandle->rocsparseData; \ + auto &rocspHandle = KokkosKernels::Impl::RocsparseSingleton::singleton() \ + .rocsparseHandle; \ + rocsparse_pointer_mode oldPtrMode; \ + \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_set_stream(rocspHandle, exec.hip_stream())); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_get_pointer_mode(rocspHandle, &oldPtrMode)); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_set_pointer_mode( \ + rocspHandle, rocsparse_pointer_mode_host)); /* alpha, beta on host*/ \ + OFFSET_TYPE nnzA = colidxA.extent(0); \ + OFFSET_TYPE nnzB = colidxB.extent(0); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_##TOKEN##csrgeam( \ + rocspHandle, m, n, \ + reinterpret_cast(&alpha), rocData.descrA, \ + nnzA, reinterpret_cast(valuesA.data()), \ + rowmapA.data(), colidxA.data(), \ + reinterpret_cast(&beta), rocData.descrB, \ + nnzB, reinterpret_cast(valuesB.data()), \ + rowmapB.data(), colidxB.data(), rocData.descrC, \ + reinterpret_cast(valuesC.data()), \ + const_cast(rowmapC.data()), colidxC.data())); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_set_pointer_mode(rocspHandle, oldPtrMode)); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_set_stream(rocspHandle, NULL)); \ + \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE_EXT(ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE( \ + s, float, float, int, int, Kokkos::LayoutLeft, Kokkos::HIP, \ + Kokkos::HIPSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE( \ + d, double, double, int, int, Kokkos::LayoutLeft, Kokkos::HIP, \ + Kokkos::HIPSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE( \ + c, Kokkos::complex, rocsparse_float_complex, int, int, \ + Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE( \ + z, Kokkos::complex, rocsparse_double_complex, int, int, \ + Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, ETI_SPEC_AVAIL) + +KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE_EXT(true) +KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_DECL_ROCSPARSE_EXT(false) +#endif + +} // namespace Impl +} // namespace KokkosSparse + +#endif diff --git a/sparse/tpls/KokkosSparse_spadd_symbolic_tpl_spec_decl.hpp b/sparse/tpls/KokkosSparse_spadd_symbolic_tpl_spec_decl.hpp new file mode 100644 index 0000000000..fe6b51207f --- /dev/null +++ b/sparse/tpls/KokkosSparse_spadd_symbolic_tpl_spec_decl.hpp @@ -0,0 +1,238 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef KOKKOSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_HPP_ +#define KOKKOSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_HPP_ + +namespace KokkosSparse { +namespace Impl { + +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE + +#define KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE( \ + TOKEN, KOKKOS_SCALAR_TYPE, TPL_SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, \ + LAYOUT_TYPE, EXEC_SPACE_TYPE, MEM_SPACE_TYPE, ETI_SPEC_AVAIL) \ + template <> \ + struct SPADD_SYMBOLIC< \ + EXEC_SPACE_TYPE, \ + KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + using kernelhandle_t = KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>; \ + using rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits >; \ + using non_const_rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits >; \ + using colidx_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits >; \ + static void spadd_symbolic(const EXEC_SPACE_TYPE& exec, \ + kernelhandle_t* handle, const ORDINAL_TYPE m, \ + const ORDINAL_TYPE n, rowmap_view_t rowmapA, \ + colidx_view_t colidxA, rowmap_view_t rowmapB, \ + colidx_view_t colidxB, \ + non_const_rowmap_view_t rowmapC) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosSparse::spadd_symbolic[TPL_CUSPARSE," + \ + Kokkos::ArithTraits::name() + "]"); \ + \ + auto addHandle = handle->get_spadd_handle(); \ + auto& cuspData = addHandle->cusparseData; \ + auto& cuspHandle = \ + KokkosKernels::Impl::CusparseSingleton::singleton().cusparseHandle; \ + \ + /* Not easy to init 'one' for cuda complex, so we don't init it. Anyway, \ + * the uninit'ed var won't affect C's pattern. \ + */ \ + TPL_SCALAR_TYPE one; \ + size_t nbytes; \ + OFFSET_TYPE nnzA = colidxA.extent(0); \ + OFFSET_TYPE nnzB = colidxB.extent(0); \ + OFFSET_TYPE nnzC = 0; \ + \ + KOKKOS_CUSPARSE_SAFE_CALL( \ + cusparseSetStream(cuspHandle, exec.cuda_stream())); \ + \ + /* https://docs.nvidia.com/cuda/cusparse/index.html#cusparsecreatematdescr \ + It sets the fields MatrixType and IndexBase to the default values \ + CUSPARSE_MATRIX_TYPE_GENERAL and CUSPARSE_INDEX_BASE_ZERO, \ + respectively, while leaving other fields uninitialized. */ \ + \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&cuspData.descrA)); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&cuspData.descrB)); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&cuspData.descrC)); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparse##TOKEN##csrgeam2_bufferSizeExt( \ + cuspHandle, m, n, &one, cuspData.descrA, nnzA, NULL, rowmapA.data(), \ + colidxA.data(), &one, cuspData.descrB, nnzB, NULL, rowmapB.data(), \ + colidxB.data(), cuspData.descrC, NULL, rowmapC.data(), NULL, \ + &nbytes)); \ + cuspData.nbytes = nbytes; \ + cuspData.workspace = Kokkos::kokkos_malloc(nbytes); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseXcsrgeam2Nnz( \ + cuspHandle, m, n, cuspData.descrA, nnzA, rowmapA.data(), \ + colidxA.data(), cuspData.descrB, nnzB, rowmapB.data(), \ + colidxB.data(), cuspData.descrC, rowmapC.data(), &nnzC, \ + cuspData.workspace)); \ + addHandle->set_c_nnz(nnzC); \ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSetStream(cuspHandle, NULL)); \ + \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE_EXT(ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE( \ + S, float, float, int, int, Kokkos::LayoutLeft, Kokkos::Cuda, \ + Kokkos::CudaSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE( \ + D, double, double, int, int, Kokkos::LayoutLeft, Kokkos::Cuda, \ + Kokkos::CudaSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE( \ + C, Kokkos::complex, cuComplex, int, int, Kokkos::LayoutLeft, \ + Kokkos::Cuda, Kokkos::CudaSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE( \ + Z, Kokkos::complex, cuDoubleComplex, int, int, \ + Kokkos::LayoutLeft, Kokkos::Cuda, Kokkos::CudaSpace, ETI_SPEC_AVAIL) + +KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE_EXT(true) +KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_CUSPARSE_EXT(false) +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE + +#define KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE( \ + KOKKOS_SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, ETI_SPEC_AVAIL) \ + template <> \ + struct SPADD_SYMBOLIC< \ + EXEC_SPACE_TYPE, \ + KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + true, ETI_SPEC_AVAIL> { \ + using kernelhandle_t = KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const KOKKOS_SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>; \ + using rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits >; \ + using non_const_rowmap_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits >; \ + using colidx_view_t = \ + Kokkos::View, \ + Kokkos::MemoryTraits >; \ + static void spadd_symbolic(const EXEC_SPACE_TYPE& exec, \ + kernelhandle_t* handle, const ORDINAL_TYPE m, \ + const ORDINAL_TYPE n, rowmap_view_t rowmapA, \ + colidx_view_t colidxA, rowmap_view_t rowmapB, \ + colidx_view_t colidxB, \ + non_const_rowmap_view_t rowmapC) { \ + Kokkos::Profiling::pushRegion( \ + "KokkosSparse::spadd_symbolic[TPL_ROCSPARSE," + \ + Kokkos::ArithTraits::name() + "]"); \ + \ + auto addHandle = handle->get_spadd_handle(); \ + auto& rocData = addHandle->rocsparseData; \ + auto& rocspHandle = KokkosKernels::Impl::RocsparseSingleton::singleton() \ + .rocsparseHandle; \ + OFFSET_TYPE nnzA = colidxA.extent(0); \ + OFFSET_TYPE nnzB = colidxB.extent(0); \ + OFFSET_TYPE nnzC = 0; \ + \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_set_stream(rocspHandle, exec.hip_stream())); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_create_mat_descr(&rocData.descrA)); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_create_mat_descr(&rocData.descrB)); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_create_mat_descr(&rocData.descrC)); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_csrgeam_nnz( \ + rocspHandle, m, n, rocData.descrA, nnzA, rowmapA.data(), \ + colidxA.data(), rocData.descrB, nnzB, rowmapB.data(), \ + colidxB.data(), rocData.descrC, rowmapC.data(), &nnzC)); \ + addHandle->set_c_nnz(nnzC); \ + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( \ + rocsparse_set_stream(rocspHandle, NULL)); \ + Kokkos::Profiling::popRegion(); \ + } \ + }; + +#define KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE_EXT( \ + ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE( \ + float, rocsparse_int, rocsparse_int, Kokkos::LayoutLeft, Kokkos::HIP, \ + Kokkos::HIPSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE( \ + double, rocsparse_int, rocsparse_int, Kokkos::LayoutLeft, Kokkos::HIP, \ + Kokkos::HIPSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE( \ + Kokkos::complex, rocsparse_int, rocsparse_int, \ + Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, ETI_SPEC_AVAIL) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE( \ + Kokkos::complex, rocsparse_int, rocsparse_int, \ + Kokkos::LayoutLeft, Kokkos::HIP, Kokkos::HIPSpace, ETI_SPEC_AVAIL) + +KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE_EXT(true) +KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_DECL_ROCSPARSE_EXT(false) +#endif + +} // namespace Impl +} // namespace KokkosSparse + +#endif diff --git a/sparse/tpls/KokkosSparse_spadd_tpl_spec_avail.hpp b/sparse/tpls/KokkosSparse_spadd_tpl_spec_avail.hpp index b654c4331c..6d4db8731f 100644 --- a/sparse/tpls/KokkosSparse_spadd_tpl_spec_avail.hpp +++ b/sparse/tpls/KokkosSparse_spadd_tpl_spec_avail.hpp @@ -21,20 +21,125 @@ namespace KokkosSparse { namespace Impl { // Specialization struct which defines whether a specialization exists // -template +template struct spadd_symbolic_tpl_spec_avail { enum : bool { value = false }; }; -template +template struct spadd_numeric_tpl_spec_avail { enum : bool { value = false }; }; +#define KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_AVAIL( \ + SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, EXEC_SPACE_TYPE, \ + MEM_SPACE_TYPE) \ + template <> \ + struct spadd_symbolic_tpl_spec_avail< \ + EXEC_SPACE_TYPE, \ + KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +#define KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_AVAIL( \ + SCALAR_TYPE, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, EXEC_SPACE_TYPE, \ + MEM_SPACE_TYPE) \ + template <> \ + struct spadd_numeric_tpl_spec_avail< \ + EXEC_SPACE_TYPE, \ + KokkosKernels::Experimental::KokkosKernelsHandle< \ + const OFFSET_TYPE, const ORDINAL_TYPE, const SCALAR_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE, MEM_SPACE_TYPE>, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits >, \ + Kokkos::View, \ + Kokkos::MemoryTraits > > { \ + enum : bool { value = true }; \ + }; + +#define KOKKOSSPARSE_SPADD_TPL_SPEC_AVAIL( \ + ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_AVAIL(float, ORDINAL_TYPE, OFFSET_TYPE, \ + LAYOUT_TYPE, EXEC_SPACE_TYPE, \ + MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_AVAIL(double, ORDINAL_TYPE, \ + OFFSET_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_AVAIL( \ + Kokkos::complex, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_SYMBOLIC_TPL_SPEC_AVAIL( \ + Kokkos::complex, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_AVAIL(float, ORDINAL_TYPE, OFFSET_TYPE, \ + LAYOUT_TYPE, EXEC_SPACE_TYPE, \ + MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_AVAIL(double, ORDINAL_TYPE, OFFSET_TYPE, \ + LAYOUT_TYPE, EXEC_SPACE_TYPE, \ + MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_AVAIL( \ + Kokkos::complex, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) \ + KOKKOSSPARSE_SPADD_NUMERIC_TPL_SPEC_AVAIL( \ + Kokkos::complex, ORDINAL_TYPE, OFFSET_TYPE, LAYOUT_TYPE, \ + EXEC_SPACE_TYPE, MEM_SPACE_TYPE) + +#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE +KOKKOSSPARSE_SPADD_TPL_SPEC_AVAIL(int, int, Kokkos::LayoutLeft, Kokkos::Cuda, + Kokkos::CudaSpace) +#endif + +#ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE +KOKKOSSPARSE_SPADD_TPL_SPEC_AVAIL(rocsparse_int, rocsparse_int, + Kokkos::LayoutLeft, Kokkos::HIP, + Kokkos::HIPSpace) +#endif + } // namespace Impl } // namespace KokkosSparse diff --git a/sparse/tpls/KokkosSparse_spgemm_numeric_tpl_spec_avail.hpp b/sparse/tpls/KokkosSparse_spgemm_numeric_tpl_spec_avail.hpp index e144b53162..517e104988 100644 --- a/sparse/tpls/KokkosSparse_spgemm_numeric_tpl_spec_avail.hpp +++ b/sparse/tpls/KokkosSparse_spgemm_numeric_tpl_spec_avail.hpp @@ -82,10 +82,12 @@ struct spgemm_numeric_tpl_spec_avail { SPGEMM_NUMERIC_AVAIL_CUSPARSE(SCALAR, Kokkos::CudaSpace) \ SPGEMM_NUMERIC_AVAIL_CUSPARSE(SCALAR, Kokkos::CudaUVMSpace) +#if (CUDA_VERSION < 11000) || (CUDA_VERSION >= 11040) SPGEMM_NUMERIC_AVAIL_CUSPARSE_S(float) SPGEMM_NUMERIC_AVAIL_CUSPARSE_S(double) SPGEMM_NUMERIC_AVAIL_CUSPARSE_S(Kokkos::complex) SPGEMM_NUMERIC_AVAIL_CUSPARSE_S(Kokkos::complex) +#endif #endif diff --git a/sparse/tpls/KokkosSparse_spgemm_symbolic_tpl_spec_avail.hpp b/sparse/tpls/KokkosSparse_spgemm_symbolic_tpl_spec_avail.hpp index b8c545ffe2..41e8802214 100644 --- a/sparse/tpls/KokkosSparse_spgemm_symbolic_tpl_spec_avail.hpp +++ b/sparse/tpls/KokkosSparse_spgemm_symbolic_tpl_spec_avail.hpp @@ -67,11 +67,13 @@ struct spgemm_symbolic_tpl_spec_avail { SPGEMM_SYMBOLIC_AVAIL_CUSPARSE(SCALAR, Kokkos::CudaSpace) \ SPGEMM_SYMBOLIC_AVAIL_CUSPARSE(SCALAR, Kokkos::CudaUVMSpace) +#if (CUDA_VERSION < 11000) || (CUDA_VERSION >= 11040) SPGEMM_SYMBOLIC_AVAIL_CUSPARSE_S(float) SPGEMM_SYMBOLIC_AVAIL_CUSPARSE_S(double) SPGEMM_SYMBOLIC_AVAIL_CUSPARSE_S(Kokkos::complex) SPGEMM_SYMBOLIC_AVAIL_CUSPARSE_S(Kokkos::complex) #endif +#endif #ifdef KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE #define SPGEMM_SYMBOLIC_AVAIL_ROCSPARSE(SCALAR) \ diff --git a/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_avail.hpp b/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_avail.hpp index 07bb0a0f0a..16bf1abecf 100644 --- a/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_avail.hpp +++ b/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_avail.hpp @@ -22,10 +22,10 @@ #endif namespace KokkosSparse { -namespace Experimental { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct spmv_bsrmatrix_tpl_spec_avail { enum : bool { value = false }; }; @@ -41,6 +41,8 @@ struct spmv_bsrmatrix_tpl_spec_avail { template <> \ struct spmv_bsrmatrix_tpl_spec_avail< \ Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR, const ORDINAL, Kokkos::Device, \ Kokkos::MemoryTraits, const OFFSET>, \ @@ -127,22 +129,24 @@ KOKKOSSPARSE_SPMV_BSRMATRIX_TPL_SPEC_AVAIL_CUSPARSE(Kokkos::complex, #endif // KOKKOSKERNELS_ENABLE_TPL_CUSPARSE #ifdef KOKKOSKERNELS_ENABLE_TPL_MKL -#define KOKKOSSPARSE_SPMV_BSRMATRIX_TPL_SPEC_AVAIL_MKL(SCALAR, EXECSPACE) \ - template <> \ - struct spmv_bsrmatrix_tpl_spec_avail< \ - EXECSPACE, \ - ::KokkosSparse::Experimental::BsrMatrix< \ - const SCALAR, const MKL_INT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits, const MKL_INT>, \ - Kokkos::View< \ - const SCALAR*, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>> { \ - enum : bool { value = true }; \ +#define KOKKOSSPARSE_SPMV_BSRMATRIX_TPL_SPEC_AVAIL_MKL(SCALAR, EXECSPACE) \ + template <> \ + struct spmv_bsrmatrix_tpl_spec_avail< \ + EXECSPACE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + ::KokkosSparse::Experimental::BsrMatrix< \ + const SCALAR, const MKL_INT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, const MKL_INT>, \ + Kokkos::View< \ + const SCALAR*, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ }; #ifdef KOKKOS_ENABLE_SERIAL @@ -166,7 +170,8 @@ KOKKOSSPARSE_SPMV_BSRMATRIX_TPL_SPEC_AVAIL_MKL(Kokkos::complex, #endif // Specialization struct which defines whether a specialization exists -template > struct spmv_mv_bsrmatrix_tpl_spec_avail { @@ -184,6 +189,8 @@ struct spmv_mv_bsrmatrix_tpl_spec_avail { template <> \ struct spmv_mv_bsrmatrix_tpl_spec_avail< \ Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR, const ORDINAL, Kokkos::Device, \ Kokkos::MemoryTraits, const OFFSET>, \ @@ -231,23 +238,25 @@ KOKKOSSPARSE_SPMV_MV_BSRMATRIX_TPL_SPEC_AVAIL_CUSPARSE(Kokkos::complex, #endif // KOKKOSKERNELS_ENABLE_TPL_CUSPARSE #ifdef KOKKOSKERNELS_ENABLE_TPL_MKL -#define KOKKOSSPARSE_SPMV_MV_BSRMATRIX_TPL_SPEC_AVAIL_MKL(SCALAR, EXECSPACE) \ - template <> \ - struct spmv_mv_bsrmatrix_tpl_spec_avail< \ - EXECSPACE, \ - ::KokkosSparse::Experimental::BsrMatrix< \ - const SCALAR, const int, \ - Kokkos::Device, \ - Kokkos::MemoryTraits, const int>, \ - Kokkos::View< \ - const SCALAR*, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - true> { \ - enum : bool { value = true }; \ +#define KOKKOSSPARSE_SPMV_MV_BSRMATRIX_TPL_SPEC_AVAIL_MKL(SCALAR, EXECSPACE) \ + template <> \ + struct spmv_mv_bsrmatrix_tpl_spec_avail< \ + EXECSPACE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + ::KokkosSparse::Experimental::BsrMatrix< \ + const SCALAR, const int, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, const int>, \ + Kokkos::View< \ + const SCALAR*, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true> { \ + enum : bool { value = true }; \ }; #ifdef KOKKOS_ENABLE_SERIAL @@ -279,6 +288,8 @@ KOKKOSSPARSE_SPMV_MV_BSRMATRIX_TPL_SPEC_AVAIL_MKL(Kokkos::complex, template <> \ struct spmv_bsrmatrix_tpl_spec_avail< \ Kokkos::HIP, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ const SCALAR, const ORDINAL, Kokkos::Device, \ Kokkos::MemoryTraits, const OFFSET>, \ @@ -336,7 +347,6 @@ KOKKOSSPARSE_SPMV_BSRMATRIX_TPL_SPEC_AVAIL_ROCSPARSE(Kokkos::complex, #endif // defined(KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE) } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse #endif // KOKKOSPARSE_SPMV_BSRMATRIX_TPL_SPEC_AVAIL_HPP_ diff --git a/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_decl.hpp b/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_decl.hpp index 75752190e7..9c844ff910 100644 --- a/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_decl.hpp +++ b/sparse/tpls/KokkosSparse_spmv_bsrmatrix_tpl_spec_decl.hpp @@ -18,228 +18,222 @@ #define KOKKOSSPARSE_SPMV_BSRMATRIX_TPL_SPEC_DECL_HPP #include "KokkosKernels_AlwaysFalse.hpp" -#include "KokkosKernels_Controls.hpp" #include "KokkosSparse_Utils_mkl.hpp" #include "KokkosSparse_Utils_cusparse.hpp" +#include "KokkosKernels_tpl_handles_decl.hpp" -#ifdef KOKKOSKERNELS_ENABLE_TPL_MKL +#if defined(KOKKOSKERNELS_ENABLE_TPL_MKL) && (__INTEL_MKL__ > 2017) #include namespace KokkosSparse { -namespace Experimental { namespace Impl { -#if (__INTEL_MKL__ > 2017) // MKL 2018 and above: use new interface: sparse_matrix_t and mkl_sparse_?_mv() -using KokkosSparse::Impl::mode_kk_to_mkl; - -inline matrix_descr getDescription() { - matrix_descr A_descr; - A_descr.type = SPARSE_MATRIX_TYPE_GENERAL; - A_descr.mode = SPARSE_FILL_MODE_FULL; - A_descr.diag = SPARSE_DIAG_NON_UNIT; - return A_descr; -} - -inline void spmv_block_impl_mkl(sparse_operation_t op, float alpha, float beta, - MKL_INT m, MKL_INT n, MKL_INT b, - const MKL_INT* Arowptrs, - const MKL_INT* Aentries, const float* Avalues, - const float* x, float* y) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), const_cast(Avalues))); - - matrix_descr A_descr = getDescription(); - KOKKOSKERNELS_MKL_SAFE_CALL( - mkl_sparse_s_mv(op, alpha, A_mkl, A_descr, x, beta, y)); -} - -inline void spmv_block_impl_mkl(sparse_operation_t op, double alpha, - double beta, MKL_INT m, MKL_INT n, MKL_INT b, - const MKL_INT* Arowptrs, - const MKL_INT* Aentries, const double* Avalues, - const double* x, double* y) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), const_cast(Avalues))); - - matrix_descr A_descr = getDescription(); - KOKKOSKERNELS_MKL_SAFE_CALL( - mkl_sparse_d_mv(op, alpha, A_mkl, A_descr, x, beta, y)); -} - -inline void spmv_block_impl_mkl(sparse_operation_t op, - Kokkos::complex alpha, - Kokkos::complex beta, MKL_INT m, - MKL_INT n, MKL_INT b, const MKL_INT* Arowptrs, - const MKL_INT* Aentries, - const Kokkos::complex* Avalues, - const Kokkos::complex* x, - Kokkos::complex* y) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), (MKL_Complex8*)Avalues)); - - MKL_Complex8 alpha_mkl{alpha.real(), alpha.imag()}; - MKL_Complex8 beta_mkl{beta.real(), beta.imag()}; - matrix_descr A_descr = getDescription(); - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_mv( - op, alpha_mkl, A_mkl, A_descr, reinterpret_cast(x), - beta_mkl, reinterpret_cast(y))); -} - -inline void spmv_block_impl_mkl(sparse_operation_t op, - Kokkos::complex alpha, - Kokkos::complex beta, MKL_INT m, - MKL_INT n, MKL_INT b, const MKL_INT* Arowptrs, - const MKL_INT* Aentries, - const Kokkos::complex* Avalues, - const Kokkos::complex* x, - Kokkos::complex* y) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), (MKL_Complex16*)Avalues)); - - matrix_descr A_descr = getDescription(); - MKL_Complex16 alpha_mkl{alpha.real(), alpha.imag()}; - MKL_Complex16 beta_mkl{beta.real(), beta.imag()}; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_mv( - op, alpha_mkl, A_mkl, A_descr, reinterpret_cast(x), - beta_mkl, reinterpret_cast(y))); -} - -inline void spm_mv_block_impl_mkl(sparse_operation_t op, float alpha, - float beta, MKL_INT m, MKL_INT n, MKL_INT b, - const MKL_INT* Arowptrs, - const MKL_INT* Aentries, const float* Avalues, - const float* x, MKL_INT colx, MKL_INT ldx, - float* y, MKL_INT ldy) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), const_cast(Avalues))); - - matrix_descr A_descr = getDescription(); - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_mm(op, alpha, A_mkl, A_descr, - SPARSE_LAYOUT_ROW_MAJOR, x, colx, - ldx, beta, y, ldy)); -} - -inline void spm_mv_block_impl_mkl(sparse_operation_t op, double alpha, - double beta, MKL_INT m, MKL_INT n, MKL_INT b, - const MKL_INT* Arowptrs, - const MKL_INT* Aentries, - const double* Avalues, const double* x, - MKL_INT colx, MKL_INT ldx, double* y, - MKL_INT ldy) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), const_cast(Avalues))); - - matrix_descr A_descr = getDescription(); - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_mm(op, alpha, A_mkl, A_descr, - SPARSE_LAYOUT_ROW_MAJOR, x, colx, - ldx, beta, y, ldy)); -} - -inline void spm_mv_block_impl_mkl( - sparse_operation_t op, Kokkos::complex alpha, - Kokkos::complex beta, MKL_INT m, MKL_INT n, MKL_INT b, - const MKL_INT* Arowptrs, const MKL_INT* Aentries, - const Kokkos::complex* Avalues, const Kokkos::complex* x, - MKL_INT colx, MKL_INT ldx, Kokkos::complex* y, MKL_INT ldy) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), (MKL_Complex8*)Avalues)); - - MKL_Complex8 alpha_mkl{alpha.real(), alpha.imag()}; - MKL_Complex8 beta_mkl{beta.real(), beta.imag()}; - matrix_descr A_descr = getDescription(); - KOKKOSKERNELS_MKL_SAFE_CALL( - mkl_sparse_c_mm(op, alpha_mkl, A_mkl, A_descr, SPARSE_LAYOUT_ROW_MAJOR, - reinterpret_cast(x), colx, ldx, - beta_mkl, reinterpret_cast(y), ldy)); +// Note: Scalar here is the Kokkos type, not the MKL type +template +inline void spmv_bsr_mkl(Handle* handle, sparse_operation_t op, Scalar alpha, + Scalar beta, MKL_INT m, MKL_INT n, MKL_INT b, + const MKL_INT* Arowptrs, const MKL_INT* Aentries, + const Scalar* Avalues, const Scalar* x, Scalar* y) { + using MKLScalar = + typename KokkosSparse::Impl::KokkosToMKLScalar::type; + using ExecSpace = typename Handle::ExecutionSpaceType; + using Subhandle = KokkosSparse::Impl::MKL_SpMV_Data; + Subhandle* subhandle; + const MKLScalar* x_mkl = reinterpret_cast(x); + MKLScalar* y_mkl = reinterpret_cast(y); + if (handle->is_set_up) { + subhandle = dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for MKL BSR"); + } else { + // Use the default execution space instance, as classic MKL does not use + // a specific instance. + subhandle = new Subhandle(ExecSpace()); + handle->tpl = subhandle; + subhandle->descr.type = SPARSE_MATRIX_TYPE_GENERAL; + subhandle->descr.mode = SPARSE_FILL_MODE_FULL; + subhandle->descr.diag = SPARSE_DIAG_NON_UNIT; + // Note: the create_csr routine requires non-const values even though + // they're not actually modified + MKLScalar* Avalues_mkl = + reinterpret_cast(const_cast(Avalues)); + if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } else if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } + handle->is_set_up = true; + } + MKLScalar alpha_mkl = KokkosSparse::Impl::KokkosToMKLScalar(alpha); + MKLScalar beta_mkl = KokkosSparse::Impl::KokkosToMKLScalar(beta); + if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } else if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } } -inline void spm_mv_block_impl_mkl( - sparse_operation_t op, Kokkos::complex alpha, - Kokkos::complex beta, MKL_INT m, MKL_INT n, MKL_INT b, - const MKL_INT* Arowptrs, const MKL_INT* Aentries, - const Kokkos::complex* Avalues, const Kokkos::complex* x, - MKL_INT colx, MKL_INT ldx, Kokkos::complex* y, MKL_INT ldy) { - sparse_matrix_t A_mkl; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_create_bsr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, n, b, - const_cast(Arowptrs), const_cast(Arowptrs + 1), - const_cast(Aentries), (MKL_Complex16*)Avalues)); - - matrix_descr A_descr = getDescription(); - MKL_Complex16 alpha_mkl{alpha.real(), alpha.imag()}; - MKL_Complex16 beta_mkl{beta.real(), beta.imag()}; - KOKKOSKERNELS_MKL_SAFE_CALL( - mkl_sparse_z_mm(op, alpha_mkl, A_mkl, A_descr, SPARSE_LAYOUT_ROW_MAJOR, - reinterpret_cast(x), colx, ldx, - beta_mkl, reinterpret_cast(y), ldy)); +// Note: Scalar here is the Kokkos type, not the MKL type +template +inline void spmv_mv_bsr_mkl(Handle* handle, sparse_operation_t op, Scalar alpha, + Scalar beta, MKL_INT m, MKL_INT n, MKL_INT b, + const MKL_INT* Arowptrs, const MKL_INT* Aentries, + const Scalar* Avalues, const Scalar* x, + MKL_INT colx, MKL_INT ldx, Scalar* y, MKL_INT ldy) { + using MKLScalar = + typename KokkosSparse::Impl::KokkosToMKLScalar::type; + using ExecSpace = typename Handle::ExecutionSpaceType; + using Subhandle = KokkosSparse::Impl::MKL_SpMV_Data; + Subhandle* subhandle; + const MKLScalar* x_mkl = reinterpret_cast(x); + MKLScalar* y_mkl = reinterpret_cast(y); + if (handle->is_set_up) { + subhandle = dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for MKL BSR"); + } else { + // Use the default execution space instance, as classic MKL does not use + // a specific instance. + subhandle = new Subhandle(ExecSpace()); + handle->tpl = subhandle; + subhandle->descr.type = SPARSE_MATRIX_TYPE_GENERAL; + subhandle->descr.mode = SPARSE_FILL_MODE_FULL; + subhandle->descr.diag = SPARSE_DIAG_NON_UNIT; + // Note: the create_csr routine requires non-const values even though + // they're not actually modified + MKLScalar* Avalues_mkl = + reinterpret_cast(const_cast(Avalues)); + if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } else if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_create_bsr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, m, + n, b, const_cast(Arowptrs), + const_cast(Arowptrs + 1), const_cast(Aentries), + Avalues_mkl)); + } + handle->is_set_up = true; + } + MKLScalar alpha_mkl = KokkosSparse::Impl::KokkosToMKLScalar(alpha); + MKLScalar beta_mkl = KokkosSparse::Impl::KokkosToMKLScalar(beta); + if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_mm( + op, alpha_mkl, subhandle->mat, subhandle->descr, + SPARSE_LAYOUT_ROW_MAJOR, x_mkl, colx, ldx, beta_mkl, y_mkl, ldy)); + } else if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_mm( + op, alpha_mkl, subhandle->mat, subhandle->descr, + SPARSE_LAYOUT_ROW_MAJOR, x_mkl, colx, ldx, beta_mkl, y_mkl, ldy)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_mm( + op, alpha_mkl, subhandle->mat, subhandle->descr, + SPARSE_LAYOUT_ROW_MAJOR, x_mkl, colx, ldx, beta_mkl, y_mkl, ldy)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_mm( + op, alpha_mkl, subhandle->mat, subhandle->descr, + SPARSE_LAYOUT_ROW_MAJOR, x_mkl, colx, ldx, beta_mkl, y_mkl, ldy)); + } } -#endif - -#define KOKKOSSPARSE_SPMV_MKL(SCALAR, EXECSPACE, COMPILE_LIBRARY) \ - template <> \ - struct SPMV_BSRMATRIX< \ - EXECSPACE, \ - ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, MKL_INT const, \ - Kokkos::Device, \ - Kokkos::MemoryTraits, MKL_INT const>, \ - Kokkos::View< \ - SCALAR const*, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - true, COMPILE_LIBRARY> { \ - using device_type = Kokkos::Device; \ - using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, MKL_INT const, device_type, \ - Kokkos::MemoryTraits, MKL_INT const>; \ - using XVector = Kokkos::View< \ - SCALAR const*, Kokkos::LayoutLeft, device_type, \ - Kokkos::MemoryTraits>; \ - using YVector = Kokkos::View>; \ - using coefficient_type = typename YVector::non_const_value_type; \ - \ - static void spmv_bsrmatrix( \ - const EXECSPACE&, \ - const KokkosKernels::Experimental::Controls& /*controls*/, \ - const char mode[], const coefficient_type& alpha, const AMatrix& A, \ - const XVector& X, const coefficient_type& beta, const YVector& Y) { \ - std::string label = "KokkosSparse::spmv[TPL_MKL,BSRMATRIX" + \ - Kokkos::ArithTraits::name() + "]"; \ - Kokkos::Profiling::pushRegion(label); \ - spmv_block_impl_mkl(mode_kk_to_mkl(mode[0]), alpha, beta, A.numRows(), \ - A.numCols(), A.blockDim(), A.graph.row_map.data(), \ - A.graph.entries.data(), A.values.data(), X.data(), \ - Y.data()); \ - Kokkos::Profiling::popRegion(); \ - } \ +#define KOKKOSSPARSE_SPMV_MKL(SCALAR, EXECSPACE, COMPILE_LIBRARY) \ + template <> \ + struct SPMV_BSRMATRIX< \ + EXECSPACE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, MKL_INT const, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, MKL_INT const>, \ + Kokkos::View< \ + SCALAR const*, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, COMPILE_LIBRARY> { \ + using device_type = Kokkos::Device; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ + using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, MKL_INT const, device_type, \ + Kokkos::MemoryTraits, MKL_INT const>; \ + using XVector = Kokkos::View< \ + SCALAR const*, Kokkos::LayoutLeft, device_type, \ + Kokkos::MemoryTraits>; \ + using YVector = Kokkos::View>; \ + using coefficient_type = typename YVector::non_const_value_type; \ + \ + static void spmv_bsrmatrix(const EXECSPACE&, Handle* handle, \ + const char mode[], \ + const coefficient_type& alpha, \ + const AMatrix& A, const XVector& X, \ + const coefficient_type& beta, \ + const YVector& Y) { \ + std::string label = "KokkosSparse::spmv[TPL_MKL,BSRMATRIX" + \ + Kokkos::ArithTraits::name() + "]"; \ + Kokkos::Profiling::pushRegion(label); \ + spmv_bsr_mkl(handle, mode_kk_to_mkl(mode[0]), alpha, beta, A.numRows(), \ + A.numCols(), A.blockDim(), A.graph.row_map.data(), \ + A.graph.entries.data(), A.values.data(), X.data(), \ + Y.data()); \ + Kokkos::Profiling::popRegion(); \ + } \ }; #ifdef KOKKOS_ENABLE_SERIAL @@ -268,6 +262,8 @@ KOKKOSSPARSE_SPMV_MKL(Kokkos::complex, Kokkos::OpenMP, template <> \ struct SPMV_MV_BSRMATRIX< \ EXECSPACE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ SCALAR const, MKL_INT const, \ Kokkos::Device, \ @@ -281,9 +277,12 @@ KOKKOSSPARSE_SPMV_MKL(Kokkos::complex, Kokkos::OpenMP, Kokkos::MemoryTraits>, \ true, true, COMPILE_LIBRARY> { \ using device_type = Kokkos::Device; \ - using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, MKL_INT const, device_type, \ - Kokkos::MemoryTraits, MKL_INT const>; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ + using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, MKL_INT const, device_type, \ + Kokkos::MemoryTraits, MKL_INT const>; \ using XVector = Kokkos::View< \ SCALAR const**, Kokkos::LayoutLeft, device_type, \ Kokkos::MemoryTraits>; \ @@ -291,21 +290,22 @@ KOKKOSSPARSE_SPMV_MKL(Kokkos::complex, Kokkos::OpenMP, Kokkos::MemoryTraits>; \ using coefficient_type = typename YVector::non_const_value_type; \ \ - static void spmv_mv_bsrmatrix( \ - const EXECSPACE&, \ - const KokkosKernels::Experimental::Controls& /*controls*/, \ - const char mode[], const coefficient_type& alpha, const AMatrix& A, \ - const XVector& X, const coefficient_type& beta, const YVector& Y) { \ + static void spmv_mv_bsrmatrix(const EXECSPACE&, Handle* handle, \ + const char mode[], \ + const coefficient_type& alpha, \ + const AMatrix& A, const XVector& X, \ + const coefficient_type& beta, \ + const YVector& Y) { \ std::string label = "KokkosSparse::spmv[TPL_MKL,BSRMATRIX" + \ Kokkos::ArithTraits::name() + "]"; \ Kokkos::Profiling::pushRegion(label); \ MKL_INT colx = static_cast(X.extent(1)); \ MKL_INT ldx = static_cast(X.stride_1()); \ MKL_INT ldy = static_cast(Y.stride_1()); \ - spm_mv_block_impl_mkl(mode_kk_to_mkl(mode[0]), alpha, beta, A.numRows(), \ - A.numCols(), A.blockDim(), A.graph.row_map.data(), \ - A.graph.entries.data(), A.values.data(), X.data(), \ - colx, ldx, Y.data(), ldy); \ + spmv_mv_bsr_mkl(handle, mode_kk_to_mkl(mode[0]), alpha, beta, \ + A.numRows(), A.numCols(), A.blockDim(), \ + A.graph.row_map.data(), A.graph.entries.data(), \ + A.values.data(), X.data(), colx, ldx, Y.data(), ldy); \ Kokkos::Profiling::popRegion(); \ } \ }; @@ -335,15 +335,13 @@ KOKKOSSPARSE_SPMV_MV_MKL(Kokkos::complex, Kokkos::OpenMP, #undef KOKKOSSPARSE_SPMV_MV_MKL } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse -#endif // KOKKOSKERNELS_ENABLE_TPL_MKL +#endif // defined(KOKKOSKERNELS_ENABLE_TPL_MKL) && (__INTEL_MKL__ > 2017) // cuSPARSE #ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE #include "cusparse.h" -#include "KokkosSparse_Utils_cusparse.hpp" // // From https://docs.nvidia.com/cuda/cusparse/index.html#bsrmv @@ -352,25 +350,29 @@ KOKKOSSPARSE_SPMV_MV_MKL(Kokkos::complex, Kokkos::OpenMP, // - Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported // - Only CUSPARSE_MATRIX_TYPE_GENERAL is supported. // +#if (9000 <= CUDA_VERSION) + +#include "KokkosSparse_Utils_cusparse.hpp" + namespace KokkosSparse { -namespace Experimental { namespace Impl { -template -void spmv_block_impl_cusparse( - const Kokkos::Cuda& exec, - const KokkosKernels::Experimental::Controls& controls, const char mode[], - typename YVector::non_const_value_type const& alpha, const AMatrix& A, - const XVector& x, typename YVector::non_const_value_type const& beta, - const YVector& y) { +template +void spmv_bsr_cusparse(const Kokkos::Cuda& exec, Handle* handle, + const char mode[], + typename YVector::non_const_value_type const& alpha, + const AMatrix& A, const XVector& x, + typename YVector::non_const_value_type const& beta, + const YVector& y) { using offset_type = typename AMatrix::non_const_size_type; using entry_type = typename AMatrix::non_const_ordinal_type; using value_type = typename AMatrix::non_const_value_type; /* initialize cusparse library */ - cusparseHandle_t cusparseHandle = controls.getCusparseHandle(); + cusparseHandle_t cusparseHandle = + KokkosKernels::Impl::CusparseSingleton::singleton().cusparseHandle; /* Set cuSPARSE to use the given stream until this function exits */ - KokkosSparse::Impl::TemporarySetCusparseStream(cusparseHandle, exec); + KokkosSparse::Impl::TemporarySetCusparseStream tscs(cusparseHandle, exec); /* Set the operation mode */ cusparseOperation_t myCusparseOperation; @@ -382,70 +384,75 @@ void spmv_block_impl_cusparse( } } -#if (9000 <= CUDA_VERSION) + KokkosSparse::Impl::CuSparse9_SpMV_Data* subhandle; + + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for cusparse"); + } else { + /* create and set the subhandle and matrix descriptor */ + subhandle = new KokkosSparse::Impl::CuSparse9_SpMV_Data(exec); + handle->tpl = subhandle; + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&subhandle->mat)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetMatType(subhandle->mat, CUSPARSE_MATRIX_TYPE_GENERAL)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetMatIndexBase(subhandle->mat, CUSPARSE_INDEX_BASE_ZERO)); + handle->is_set_up = true; + } - /* create and set the matrix descriptor */ - cusparseMatDescr_t descrA = 0; - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&descrA)); - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)); - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO)); cusparseDirection_t dirA = CUSPARSE_DIRECTION_ROW; /* perform the actual SpMV operation */ - if ((std::is_same::value) && - (std::is_same::value)) { - if (std::is_same::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseSbsrmv( - cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else if (std::is_same::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDbsrmv( - cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else if (std::is_same>::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCbsrmv( - cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else if (std::is_same>::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseZbsrmv( - cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else { - throw std::logic_error( - "Trying to call cusparse[*]bsrmv with a scalar type not " - "float/double, " - "nor complex of either!"); - } + static_assert( + std::is_same_v && std::is_same_v, + "With cuSPARSE non-generic API, offset and entry types must both be int. " + "Something wrong with TPL avail logic."); + if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSbsrmv( + cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), + A.nnz(), reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), A.graph.row_map.data(), + A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + } else if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseDbsrmv( + cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), + A.nnz(), reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + } else if constexpr (std::is_same_v>) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCbsrmv( + cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), + A.nnz(), reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + } else if constexpr (std::is_same_v>) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseZbsrmv( + cusparseHandle, dirA, myCusparseOperation, A.numRows(), A.numCols(), + A.nnz(), reinterpret_cast(&alpha), + subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); } else { - throw std::logic_error( - "With cuSPARSE pre-10.0, offset and entry types must be int. " - "Something wrong with TPL avail logic."); + static_assert(KokkosKernels::Impl::always_false_v, + "Trying to call cusparse[*]bsrmv with a scalar type not " + "float/double, nor complex of either!"); } - - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyMatDescr(descrA)); -#endif // (9000 <= CUDA_VERSION) } // Reference @@ -463,29 +470,24 @@ void spmv_block_impl_cusparse( // -> C = t(t(B)) * t(A) + C // -> C = B * t(A) + C // This is impossible in cuSparse without explicitly transposing A, -// so we just do not support LayoutRight in cuSparse TPL now -// -template < - class AMatrix, class XVector, class YVector, - std::enable_if_t::value && - std::is_same::value, - bool> = true> -void spm_mv_block_impl_cusparse( - const Kokkos::Cuda& exec, - const KokkosKernels::Experimental::Controls& controls, const char mode[], - typename YVector::non_const_value_type const& alpha, const AMatrix& A, - const XVector& x, typename YVector::non_const_value_type const& beta, - const YVector& y) { +// so we just do not support LayoutRight in cuSparse TPL now (this is +// statically asserted here) +template +void spmv_mv_bsr_cusparse(const Kokkos::Cuda& exec, Handle* handle, + const char mode[], + typename YVector::non_const_value_type const& alpha, + const AMatrix& A, const XVector& x, + typename YVector::non_const_value_type const& beta, + const YVector& y) { using offset_type = typename AMatrix::non_const_size_type; using entry_type = typename AMatrix::non_const_ordinal_type; using value_type = typename AMatrix::non_const_value_type; /* initialize cusparse library */ - cusparseHandle_t cusparseHandle = controls.getCusparseHandle(); + cusparseHandle_t cusparseHandle = + KokkosKernels::Impl::CusparseSingleton::singleton().cusparseHandle; /* Set cuSPARSE to use the given stream until this function exits */ - KokkosSparse::Impl::TemporarySetCusparseStream(cusparseHandle, exec); + KokkosSparse::Impl::TemporarySetCusparseStream tscs(cusparseHandle, exec); /* Set the operation mode */ cusparseOperation_t myCusparseOperation; @@ -499,123 +501,136 @@ void spm_mv_block_impl_cusparse( int colx = static_cast(x.extent(1)); - // ldx and ldy should be the leading dimension of X,Y respectively - const int ldx = static_cast(x.extent(0)); - const int ldy = static_cast(y.extent(0)); + // ldx and ldy should be the leading dimension (stride between columns) of X,Y + // respectively + const int ldx = static_cast(x.stride(1)); + const int ldy = static_cast(y.stride(1)); -#if (9000 <= CUDA_VERSION) - - /* create and set the matrix descriptor */ - cusparseMatDescr_t descrA = 0; - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&descrA)); - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)); - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO)); + static_assert( + std::is_same_v && + std::is_same_v, + "cuSPARSE requires both X and Y to be LayoutLeft."); + + KokkosSparse::Impl::CuSparse9_SpMV_Data* subhandle; + + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for cusparse"); + } else { + /* create and set the subhandle and matrix descriptor */ + subhandle = new KokkosSparse::Impl::CuSparse9_SpMV_Data(exec); + handle->tpl = subhandle; + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&subhandle->mat)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetMatType(subhandle->mat, CUSPARSE_MATRIX_TYPE_GENERAL)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetMatIndexBase(subhandle->mat, CUSPARSE_INDEX_BASE_ZERO)); + handle->is_set_up = true; + } cusparseDirection_t dirA = CUSPARSE_DIRECTION_ROW; /* perform the actual SpMV operation */ - if ((std::is_same::value) && - (std::is_same::value)) { - if (std::is_same::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseSbsrmm( - cusparseHandle, dirA, myCusparseOperation, - CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), ldx, - reinterpret_cast(&beta), - reinterpret_cast(y.data()), ldy)); - } else if (std::is_same::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDbsrmm( - cusparseHandle, dirA, myCusparseOperation, - CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), ldx, - reinterpret_cast(&beta), - reinterpret_cast(y.data()), ldy)); - } else if (std::is_same>::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCbsrmm( - cusparseHandle, dirA, myCusparseOperation, - CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), ldx, - reinterpret_cast(&beta), - reinterpret_cast(y.data()), ldy)); - } else if (std::is_same>::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseZbsrmm( - cusparseHandle, dirA, myCusparseOperation, - CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), - reinterpret_cast(x.data()), ldx, - reinterpret_cast(&beta), - reinterpret_cast(y.data()), ldy)); - } else { - throw std::logic_error( - "Trying to call cusparse[*]bsrmm with a scalar type not " - "float/double, " - "nor complex of either!"); - } + static_assert( + std::is_same_v && std::is_same_v, + "With cuSPARSE non-generic API, offset and entry types must both be int. " + "Something wrong with TPL avail logic."); + if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSbsrmm( + cusparseHandle, dirA, myCusparseOperation, + CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), + A.nnz(), reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), A.graph.row_map.data(), + A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), ldx, + reinterpret_cast(&beta), + reinterpret_cast(y.data()), ldy)); + } else if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseDbsrmm( + cusparseHandle, dirA, myCusparseOperation, + CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), + A.nnz(), reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), ldx, + reinterpret_cast(&beta), + reinterpret_cast(y.data()), ldy)); + } else if constexpr (std::is_same_v>) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCbsrmm( + cusparseHandle, dirA, myCusparseOperation, + CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), + A.nnz(), reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), ldx, + reinterpret_cast(&beta), + reinterpret_cast(y.data()), ldy)); + } else if constexpr (std::is_same_v>) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseZbsrmm( + cusparseHandle, dirA, myCusparseOperation, + CUSPARSE_OPERATION_NON_TRANSPOSE, A.numRows(), colx, A.numCols(), + A.nnz(), reinterpret_cast(&alpha), + subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), A.blockDim(), + reinterpret_cast(x.data()), ldx, + reinterpret_cast(&beta), + reinterpret_cast(y.data()), ldy)); } else { - throw std::logic_error( - "With cuSPARSE pre-10.0, offset and entry types must be int. " - "Something wrong with TPL avail logic."); + static_assert(KokkosKernels::Impl::always_false_v, + "Trying to call cusparse[*]bsrmm with a scalar type not " + "float/double, nor complex of either!"); } - - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyMatDescr(descrA)); -#endif // (9000 <= CUDA_VERSION) } -#define KOKKOSSPARSE_SPMV_CUSPARSE(SCALAR, ORDINAL, OFFSET, LAYOUT, SPACE, \ - COMPILE_LIBRARY) \ - template <> \ - struct SPMV_BSRMATRIX< \ - Kokkos::Cuda, \ - ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, ORDINAL const, Kokkos::Device, \ - Kokkos::MemoryTraits, OFFSET const>, \ - Kokkos::View< \ - SCALAR const*, LAYOUT, Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - true, COMPILE_LIBRARY> { \ - using device_type = Kokkos::Device; \ - using memory_trait_type = Kokkos::MemoryTraits; \ - using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, ORDINAL const, device_type, memory_trait_type, \ - OFFSET const>; \ - using XVector = Kokkos::View< \ - SCALAR const*, LAYOUT, device_type, \ - Kokkos::MemoryTraits>; \ - using YVector = \ - Kokkos::View; \ - using Controls = KokkosKernels::Experimental::Controls; \ - \ - using coefficient_type = typename YVector::non_const_value_type; \ - \ - static void spmv_bsrmatrix(const Kokkos::Cuda& exec, \ - const Controls& controls, const char mode[], \ - const coefficient_type& alpha, \ - const AMatrix& A, const XVector& x, \ - const coefficient_type& beta, \ - const YVector& y) { \ - std::string label = "KokkosSparse::spmv[TPL_CUSPARSE,BSRMATRIX" + \ - Kokkos::ArithTraits::name() + "]"; \ - Kokkos::Profiling::pushRegion(label); \ - spmv_block_impl_cusparse(exec, controls, mode, alpha, A, x, beta, y); \ - Kokkos::Profiling::popRegion(); \ - } \ +#define KOKKOSSPARSE_SPMV_CUSPARSE(SCALAR, ORDINAL, OFFSET, LAYOUT, SPACE, \ + COMPILE_LIBRARY) \ + template <> \ + struct SPMV_BSRMATRIX< \ + Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, ORDINAL const, Kokkos::Device, \ + Kokkos::MemoryTraits, OFFSET const>, \ + Kokkos::View< \ + SCALAR const*, LAYOUT, Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, COMPILE_LIBRARY> { \ + using device_type = Kokkos::Device; \ + using memory_trait_type = Kokkos::MemoryTraits; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ + using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, ORDINAL const, device_type, memory_trait_type, \ + OFFSET const>; \ + using XVector = Kokkos::View< \ + SCALAR const*, LAYOUT, device_type, \ + Kokkos::MemoryTraits>; \ + using YVector = \ + Kokkos::View; \ + \ + using coefficient_type = typename YVector::non_const_value_type; \ + \ + static void spmv_bsrmatrix(const Kokkos::Cuda& exec, Handle* handle, \ + const char mode[], \ + const coefficient_type& alpha, \ + const AMatrix& A, const XVector& x, \ + const coefficient_type& beta, \ + const YVector& y) { \ + std::string label = "KokkosSparse::spmv[TPL_CUSPARSE,BSRMATRIX" + \ + Kokkos::ArithTraits::name() + "]"; \ + Kokkos::Profiling::pushRegion(label); \ + spmv_bsr_cusparse(exec, handle, mode, alpha, A, x, beta, y); \ + Kokkos::Profiling::popRegion(); \ + } \ }; -#if (9000 <= CUDA_VERSION) KOKKOSSPARSE_SPMV_CUSPARSE(double, int, int, Kokkos::LayoutLeft, Kokkos::CudaSpace, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) @@ -664,57 +679,59 @@ KOKKOSSPARSE_SPMV_CUSPARSE(Kokkos::complex, int, int, Kokkos::LayoutLeft, KOKKOSSPARSE_SPMV_CUSPARSE(Kokkos::complex, int, int, Kokkos::LayoutRight, Kokkos::CudaUVMSpace, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) -#endif // (9000 <= CUDA_VERSION) #undef KOKKOSSPARSE_SPMV_CUSPARSE // cuSparse TPL does not support LayoutRight for this operation // only specialize for LayoutLeft -#define KOKKOSSPARSE_SPMV_MV_CUSPARSE(SCALAR, ORDINAL, OFFSET, SPACE, \ - ETI_AVAIL) \ - template <> \ - struct SPMV_MV_BSRMATRIX< \ - Kokkos::Cuda, \ - ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, ORDINAL const, Kokkos::Device, \ - Kokkos::MemoryTraits, OFFSET const>, \ - Kokkos::View< \ - SCALAR const**, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - false, true, ETI_AVAIL> { \ - using device_type = Kokkos::Device; \ - using memory_trait_type = Kokkos::MemoryTraits; \ - using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, ORDINAL const, device_type, memory_trait_type, \ - OFFSET const>; \ - using XVector = Kokkos::View< \ - SCALAR const**, Kokkos::LayoutLeft, device_type, \ - Kokkos::MemoryTraits>; \ - using YVector = Kokkos::View \ + struct SPMV_MV_BSRMATRIX< \ + Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, ORDINAL const, Kokkos::Device, \ + Kokkos::MemoryTraits, OFFSET const>, \ + Kokkos::View< \ + SCALAR const**, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + false, true, ETI_AVAIL> { \ + using device_type = Kokkos::Device; \ + using memory_trait_type = Kokkos::MemoryTraits; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ + using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, ORDINAL const, device_type, memory_trait_type, \ + OFFSET const>; \ + using XVector = Kokkos::View< \ + SCALAR const**, Kokkos::LayoutLeft, device_type, \ + Kokkos::MemoryTraits>; \ + using YVector = Kokkos::View; \ - using Controls = KokkosKernels::Experimental::Controls; \ - \ - using coefficient_type = typename YVector::non_const_value_type; \ - \ - static void spmv_mv_bsrmatrix(const Kokkos::Cuda& exec, \ - const Controls& controls, const char mode[], \ - const coefficient_type& alpha, \ - const AMatrix& A, const XVector& x, \ - const coefficient_type& beta, \ - const YVector& y) { \ - std::string label = "KokkosSparse::spmv[TPL_CUSPARSE,BSRMATRIX" + \ - Kokkos::ArithTraits::name() + "]"; \ - Kokkos::Profiling::pushRegion(label); \ - spm_mv_block_impl_cusparse(exec, controls, mode, alpha, A, x, beta, y); \ - Kokkos::Profiling::popRegion(); \ - } \ + \ + using coefficient_type = typename YVector::non_const_value_type; \ + \ + static void spmv_mv_bsrmatrix(const Kokkos::Cuda& exec, Handle* handle, \ + const char mode[], \ + const coefficient_type& alpha, \ + const AMatrix& A, const XVector& x, \ + const coefficient_type& beta, \ + const YVector& y) { \ + std::string label = "KokkosSparse::spmv[TPL_CUSPARSE,BSRMATRIX" + \ + Kokkos::ArithTraits::name() + "]"; \ + Kokkos::Profiling::pushRegion(label); \ + spmv_mv_bsr_cusparse(exec, handle, mode, alpha, A, x, beta, y); \ + Kokkos::Profiling::popRegion(); \ + } \ }; -#if (9000 <= CUDA_VERSION) KOKKOSSPARSE_SPMV_MV_CUSPARSE(double, int, int, Kokkos::CudaSpace, true) KOKKOSSPARSE_SPMV_MV_CUSPARSE(double, int, int, Kokkos::CudaSpace, false) KOKKOSSPARSE_SPMV_MV_CUSPARSE(float, int, int, Kokkos::CudaSpace, true) @@ -740,13 +757,11 @@ KOKKOSSPARSE_SPMV_MV_CUSPARSE(Kokkos::complex, int, int, KOKKOSSPARSE_SPMV_MV_CUSPARSE(Kokkos::complex, int, int, Kokkos::CudaUVMSpace, false) -#endif // (9000 <= CUDA_VERSION) - #undef KOKKOSSPARSE_SPMV_MV_CUSPARSE } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse +#endif // (9000 <= CUDA_VERSION) #endif // KOKKOSKERNELS_ENABLE_TPL_CUSPARSE @@ -760,16 +775,15 @@ KOKKOSSPARSE_SPMV_MV_CUSPARSE(Kokkos::complex, int, int, #include "KokkosSparse_Utils_rocsparse.hpp" namespace KokkosSparse { -namespace Experimental { namespace Impl { -template -void spmv_block_impl_rocsparse( - const Kokkos::HIP& exec, - const KokkosKernels::Experimental::Controls& controls, const char mode[], - typename YVector::non_const_value_type const& alpha, const AMatrix& A, - const XVector& x, typename YVector::non_const_value_type const& beta, - const YVector& y) { +template +void spmv_bsr_rocsparse(const Kokkos::HIP& exec, Handle* handle, + const char mode[], + typename YVector::non_const_value_type const& alpha, + const AMatrix& A, const XVector& x, + typename YVector::non_const_value_type const& beta, + const YVector& y) { /* rocm 5.4.0 rocsparse_*bsrmv reference: https://rocsparse.readthedocs.io/en/rocm-5.4.0/usermanual.html#rocsparse-bsrmv-ex @@ -818,9 +832,10 @@ void spmv_block_impl_rocsparse( Kokkos::LayoutStride>, "A entries must be contiguous"); - rocsparse_handle handle = controls.getRocsparseHandle(); + rocsparse_handle rocsparseHandle = + KokkosKernels::Impl::RocsparseSingleton::singleton().rocsparseHandle; // resets handle stream to NULL when out of scope - KokkosSparse::Impl::TemporarySetRocsparseStream tsrs(handle, exec); + KokkosSparse::Impl::TemporarySetRocsparseStream tsrs(rocsparseHandle, exec); // set the mode rocsparse_operation trans; @@ -864,45 +879,94 @@ void spmv_block_impl_rocsparse( reinterpret_cast(&beta); rocsparse_value_type* y_ = reinterpret_cast(y.data()); - rocsparse_mat_descr descr; - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_create_mat_descr(&descr)); - rocsparse_mat_info info; - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_create_mat_info(&info)); + KokkosSparse::Impl::RocSparse_BSR_SpMV_Data* subhandle; + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for rocsparse BSR"); + } else { + subhandle = new KokkosSparse::Impl::RocSparse_BSR_SpMV_Data(exec); + handle->tpl = subhandle; + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_create_mat_descr(&subhandle->mat)); + // *_ex* functions deprecated in introduced in 6+ +#if KOKKOSSPARSE_IMPL_ROCM_VERSION >= 60000 + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_create_mat_info(&subhandle->info)); + if constexpr (std::is_same_v) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else if constexpr (std::is_same_v) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else if constexpr (std::is_same_v>) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else if constexpr (std::is_same_v>) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else { + static_assert(KokkosKernels::Impl::always_false_v, + "unsupported value type for rocsparse_*bsrmv"); + } + // *_ex* functions introduced in 5.4.0 +#elif KOKKOSSPARSE_IMPL_ROCM_VERSION < 50400 + // No analysis step in the older versions +#else + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_create_mat_info(&subhandle->info)); + if constexpr (std::is_same_v) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv_ex_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else if constexpr (std::is_same_v) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv_ex_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else if constexpr (std::is_same_v>) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv_ex_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else if constexpr (std::is_same_v>) { + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv_ex_analysis( + rocsparseHandle, dir, trans, mb, nb, nnzb, subhandle->mat, bsr_val, + bsr_row_ptr, bsr_col_ind, block_dim, subhandle->info)); + } else { + static_assert(KokkosKernels::Impl::always_false_v, + "unsupported value type for rocsparse_*bsrmv"); + } +#endif + handle->is_set_up = true; + } // *_ex* functions deprecated in introduced in 6+ #if KOKKOSSPARSE_IMPL_ROCM_VERSION >= 60000 if constexpr (std::is_same_v) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_sbsrmv(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else if constexpr (std::is_same_v) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_dbsrmv(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else if constexpr (std::is_same_v>) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_cbsrmv(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else if constexpr (std::is_same_v>) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_zbsrmv(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else { static_assert(KokkosKernels::Impl::always_false_v, "unsupported value type for rocsparse_*bsrmv"); @@ -911,72 +975,59 @@ void spmv_block_impl_rocsparse( #elif KOKKOSSPARSE_IMPL_ROCM_VERSION < 50400 if constexpr (std::is_same_v) { KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, x_, beta_, y_)); + rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, subhandle->mat, + bsr_val, bsr_row_ptr, bsr_col_ind, block_dim, x_, beta_, y_)); } else if constexpr (std::is_same_v) { KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, x_, beta_, y_)); + rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, subhandle->mat, + bsr_val, bsr_row_ptr, bsr_col_ind, block_dim, x_, beta_, y_)); } else if constexpr (std::is_same_v>) { KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, x_, beta_, y_)); + rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, subhandle->mat, + bsr_val, bsr_row_ptr, bsr_col_ind, block_dim, x_, beta_, y_)); } else if constexpr (std::is_same_v>) { KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, x_, beta_, y_)); + rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, subhandle->mat, + bsr_val, bsr_row_ptr, bsr_col_ind, block_dim, x_, beta_, y_)); } else { static_assert(KokkosKernels::Impl::always_false_v, "unsupported value type for rocsparse_*bsrmv"); } #else if constexpr (std::is_same_v) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv_ex_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_sbsrmv_ex( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_sbsrmv_ex(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else if constexpr (std::is_same_v) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv_ex_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_dbsrmv_ex( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_dbsrmv_ex(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else if constexpr (std::is_same_v>) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv_ex_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_cbsrmv_ex( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_cbsrmv_ex(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else if constexpr (std::is_same_v>) { - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv_ex_analysis( - handle, dir, trans, mb, nb, nnzb, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_zbsrmv_ex( - handle, dir, trans, mb, nb, nnzb, alpha_, descr, bsr_val, bsr_row_ptr, - bsr_col_ind, block_dim, info, x_, beta_, y_)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_bsrsv_clear(handle, info)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( + rocsparse_zbsrmv_ex(rocsparseHandle, dir, trans, mb, nb, nnzb, alpha_, + subhandle->mat, bsr_val, bsr_row_ptr, bsr_col_ind, + block_dim, subhandle->info, x_, beta_, y_)); } else { static_assert(KokkosKernels::Impl::always_false_v, "unsupported value type for rocsparse_*bsrmv"); } #endif - rocsparse_destroy_mat_descr(descr); - rocsparse_destroy_mat_info(info); - -} // spmv_block_impl_rocsparse +} // spmv_bsr_rocsparse #define KOKKOSSPARSE_SPMV_ROCSPARSE(SCALAR, ORDINAL, OFFSET, LAYOUT, SPACE, \ COMPILE_LIBRARY) \ template <> \ struct SPMV_BSRMATRIX< \ Kokkos::HIP, \ + KokkosSparse::Impl::SPMVHandleImpl, \ ::KokkosSparse::Experimental::BsrMatrix< \ SCALAR const, ORDINAL const, Kokkos::Device, \ Kokkos::MemoryTraits, OFFSET const>, \ @@ -988,20 +1039,22 @@ void spmv_block_impl_rocsparse( true, COMPILE_LIBRARY> { \ using device_type = Kokkos::Device; \ using memory_trait_type = Kokkos::MemoryTraits; \ - using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ - SCALAR const, ORDINAL const, device_type, memory_trait_type, \ - OFFSET const>; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ + using AMatrix = ::KokkosSparse::Experimental::BsrMatrix< \ + SCALAR const, ORDINAL const, device_type, memory_trait_type, \ + OFFSET const>; \ using XVector = Kokkos::View< \ SCALAR const*, LAYOUT, device_type, \ Kokkos::MemoryTraits>; \ using YVector = \ Kokkos::View; \ - using Controls = KokkosKernels::Experimental::Controls; \ \ using coefficient_type = typename YVector::non_const_value_type; \ \ - static void spmv_bsrmatrix(const Kokkos::HIP& exec, \ - const Controls& controls, const char mode[], \ + static void spmv_bsrmatrix(const Kokkos::HIP& exec, Handle* handle, \ + const char mode[], \ const coefficient_type& alpha, \ const AMatrix& A, const XVector& x, \ const coefficient_type& beta, \ @@ -1009,7 +1062,7 @@ void spmv_block_impl_rocsparse( std::string label = "KokkosSparse::spmv[TPL_ROCSPARSE,BSRMATRIX" + \ Kokkos::ArithTraits::name() + "]"; \ Kokkos::Profiling::pushRegion(label); \ - spmv_block_impl_rocsparse(exec, controls, mode, alpha, A, x, beta, y); \ + spmv_bsr_rocsparse(exec, handle, mode, alpha, A, x, beta, y); \ Kokkos::Profiling::popRegion(); \ } \ }; @@ -1044,7 +1097,6 @@ KOKKOSSPARSE_SPMV_ROCSPARSE(Kokkos::complex, rocsparse_int, #undef KOKKOSSPARSE_SPMV_ROCSPARSE } // namespace Impl -} // namespace Experimental } // namespace KokkosSparse #endif // defined(KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE) diff --git a/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_avail.hpp b/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_avail.hpp index 5e33df1fa3..88fef4421a 100644 --- a/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_avail.hpp +++ b/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_avail.hpp @@ -21,7 +21,8 @@ namespace KokkosSparse { namespace Impl { // Specialization struct which defines whether a specialization exists -template > struct spmv_mv_tpl_spec_avail { @@ -33,6 +34,8 @@ struct spmv_mv_tpl_spec_avail { template <> \ struct spmv_mv_tpl_spec_avail< \ Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ KokkosSparse::CrsMatrix< \ const SCALAR, const ORDINAL, Kokkos::Device, \ Kokkos::MemoryTraits, const OFFSET>, \ @@ -48,7 +51,14 @@ struct spmv_mv_tpl_spec_avail { non-transpose that produces incorrect result. This is cusparse distributed with CUDA 10.1.243. The bug seems to be resolved by CUSPARSE 10301 (present by CUDA 10.2.89) */ -#if defined(CUSPARSE_VERSION) && (10301 <= CUSPARSE_VERSION) + +/* cusparseSpMM also produces incorrect results for some inputs in CUDA 11.6.1. + * (CUSPARSE_VERSION 11702). + * ALG1 and ALG3 produce completely incorrect results for one set of inputs. + * ALG2 works for that case, but has low numerical accuracy in another case. + */ +#if defined(CUSPARSE_VERSION) && (10301 <= CUSPARSE_VERSION) && \ + (CUSPARSE_VERSION != 11702) KOKKOSSPARSE_SPMV_MV_TPL_SPEC_AVAIL_CUSPARSE(double, int, int, Kokkos::LayoutLeft, Kokkos::LayoutLeft, diff --git a/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_decl.hpp b/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_decl.hpp index 30e0b6e243..47b7d47f8e 100644 --- a/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_decl.hpp +++ b/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_decl.hpp @@ -18,15 +18,18 @@ #define KOKKOSPARSE_SPMV_MV_TPL_SPEC_DECL_HPP_ #include - -#include "KokkosKernels_Controls.hpp" +#include "KokkosKernels_tpl_handles_decl.hpp" #ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE /* CUSPARSE_VERSION < 10301 either doesn't have cusparseSpMM or the non-tranpose version produces incorrect results. + + Version 11702 corresponds to CUDA 11.6.1, which also produces incorrect + results. 11701 (CUDA 11.6.0) is OK. */ -#if defined(CUSPARSE_VERSION) && (10301 <= CUSPARSE_VERSION) +#if defined(CUSPARSE_VERSION) && (10301 <= CUSPARSE_VERSION) && \ + (CUSPARSE_VERSION != 11702) #include "cusparse.h" #include "KokkosSparse_Utils_cusparse.hpp" @@ -64,9 +67,14 @@ inline cudaDataType compute_type() { */ template = true> cusparseDnMatDescr_t make_cusparse_dn_mat_descr_t(ViewType &view) { - const int64_t rows = view.extent(0); - const int64_t cols = view.extent(1); - const int64_t ld = view.extent(0); + // If the view is LayoutRight, we still need to create descr as column-major + // but it should be an implicit transpose, meaning dimensions and strides are + // swapped + bool transpose = + std::is_same_v; + const size_t rows = transpose ? view.extent(1) : view.extent(0); + const size_t cols = transpose ? view.extent(0) : view.extent(1); + const size_t ld = transpose ? view.stride(0) : view.stride(1); // cusparseCreateCsr notes it is safe to const_cast this away for input // pointers to a descriptor as long as that descriptor is not an output @@ -84,15 +92,15 @@ cusparseDnMatDescr_t make_cusparse_dn_mat_descr_t(ViewType &view) { const cusparseOrder_t order = CUSPARSE_ORDER_COL; cusparseDnMatDescr_t descr; - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseCreateDnMat(&descr, rows, cols, ld, values, valueType, order)); + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateDnMat( + &descr, static_cast(rows), static_cast(cols), + static_cast(ld), values, valueType, order)); return descr; } -template -void spmv_mv_cusparse(const Kokkos::Cuda &exec, - const KokkosKernels::Experimental::Controls &controls, +template +void spmv_mv_cusparse(const Kokkos::Cuda &exec, Handle *handle, const char mode[], typename YVector::non_const_value_type const &alpha, const AMatrix &A, const XVector &x, @@ -110,9 +118,17 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, using y_value_type = typename YVector::non_const_value_type; /* initialize cusparse library */ - cusparseHandle_t cusparseHandle = controls.getCusparseHandle(); + cusparseHandle_t cusparseHandle = + KokkosKernels::Impl::CusparseSingleton::singleton().cusparseHandle; /* Set cuSPARSE to use the given stream until this function exits */ - TemporarySetCusparseStream(cusparseHandle, exec); + TemporarySetCusparseStream tscs(cusparseHandle, exec); + + /* Check that cusparse can handle the types of the input Kokkos::CrsMatrix */ + const cusparseIndexType_t myCusparseOffsetType = + cusparse_index_type_t_from(); + const cusparseIndexType_t myCusparseEntryType = + cusparse_index_type_t_from(); + const cudaDataType aCusparseType = cuda_data_type_from(); /* Set the operation mode */ cusparseOperation_t opA; @@ -127,21 +143,6 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, } } - /* Check that cusparse can handle the types of the input Kokkos::CrsMatrix */ - const cusparseIndexType_t myCusparseOffsetType = - cusparse_index_type_t_from(); - const cusparseIndexType_t myCusparseEntryType = - cusparse_index_type_t_from(); - const cudaDataType aCusparseType = cuda_data_type_from(); - - /* create matrix */ - cusparseSpMatDescr_t A_cusparse; - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateCsr( - &A_cusparse, A.numRows(), A.numCols(), A.nnz(), - (void *)A.graph.row_map.data(), (void *)A.graph.entries.data(), - (void *)A.values.data(), myCusparseOffsetType, myCusparseEntryType, - CUSPARSE_INDEX_BASE_ZERO, aCusparseType)); - /* create lhs and rhs NOTE: The descriptions always say vecX and vecY are column-major cusparse order. For CUSPARSE_VERSION 10301 this is the only supported ordering. if X @@ -152,16 +153,20 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, constexpr bool xIsLR = std::is_same::value; static_assert(xIsLL || xIsLR, "X multivector was not LL or LR (TPL error)"); + static_assert( + std::is_same_v, + "Y multivector was not LL (TPL error)"); cusparseDnMatDescr_t vecX = make_cusparse_dn_mat_descr_t(x); cusparseDnMatDescr_t vecY = make_cusparse_dn_mat_descr_t(y); cusparseOperation_t opB = xIsLL ? CUSPARSE_OPERATION_NON_TRANSPOSE : CUSPARSE_OPERATION_TRANSPOSE; -// CUSPARSE_MM_ALG_DEFAULT was deprecated as early as 11.1 (maybe earlier) -#if CUSPARSE_VERSION < 11010 - const cusparseSpMMAlg_t alg = CUSPARSE_MM_ALG_DEFAULT; +// CUSPARSE_MM_ALG_DEFAULT was deprecated in CUDA 11.0.1 / cuSPARSE 11.0.0 and +// removed in CUDA 12.0.0 / cuSPARSE 12.0.0 +#if CUSPARSE_VERSION < 11000 + cusparseSpMMAlg_t algo = CUSPARSE_MM_ALG_DEFAULT; #else - const cusparseSpMMAlg_t alg = CUSPARSE_SPMM_ALG_DEFAULT; + cusparseSpMMAlg_t algo = CUSPARSE_SPMM_ALG_DEFAULT; #endif // the precision of the SpMV @@ -180,21 +185,39 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, } } - size_t bufferSize = 0; - KOKKOS_CUSPARSE_SAFE_CALL(cusparseSpMM_bufferSize( - cusparseHandle, opA, opB, &alpha, A_cusparse, vecX, &beta, vecY, - computeType, alg, &bufferSize)); + KokkosSparse::Impl::CuSparse10_SpMV_Data *subhandle; + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for cusparse"); + } else { + subhandle = new KokkosSparse::Impl::CuSparse10_SpMV_Data(exec); + handle->tpl = subhandle; + /* create matrix */ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateCsr( + &subhandle->mat, A.numRows(), A.numCols(), A.nnz(), + (void *)A.graph.row_map.data(), (void *)A.graph.entries.data(), + (void *)A.values.data(), myCusparseOffsetType, myCusparseEntryType, + CUSPARSE_INDEX_BASE_ZERO, aCusparseType)); + + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSpMM_bufferSize( + cusparseHandle, opA, opB, &alpha, subhandle->mat, vecX, &beta, vecY, + computeType, algo, &subhandle->bufferSize)); + + KOKKOS_IMPL_CUDA_SAFE_CALL( + cudaMalloc(&subhandle->buffer, subhandle->bufferSize)); + + handle->is_set_up = true; + } - void *dBuffer = nullptr; - KOKKOS_IMPL_CUDA_SAFE_CALL(cudaMalloc(&dBuffer, bufferSize)); KOKKOS_CUSPARSE_SAFE_CALL(cusparseSpMM(cusparseHandle, opA, opB, &alpha, - A_cusparse, vecX, &beta, vecY, - computeType, alg, dBuffer)); + subhandle->mat, vecX, &beta, vecY, + computeType, algo, subhandle->buffer)); - KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFree(dBuffer)); KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyDnMat(vecX)); KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyDnMat(vecY)); - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroySpMat(A_cusparse)); } #define KOKKOSSPARSE_SPMV_MV_CUSPARSE(SCALAR, ORDINAL, OFFSET, XL, YL, SPACE, \ @@ -202,6 +225,8 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, template <> \ struct SPMV_MV< \ Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ KokkosSparse::CrsMatrix< \ SCALAR const, ORDINAL const, Kokkos::Device, \ Kokkos::MemoryTraits, OFFSET const>, \ @@ -213,6 +238,9 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, false, true, COMPILE_LIBRARY> { \ using device_type = Kokkos::Device; \ using memory_trait_type = Kokkos::MemoryTraits; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ using AMatrix = CrsMatrix; \ using XVector = Kokkos::View< \ @@ -223,15 +251,14 @@ void spmv_mv_cusparse(const Kokkos::Cuda &exec, \ using coefficient_type = typename YVector::non_const_value_type; \ \ - using Controls = KokkosKernels::Experimental::Controls; \ - static void spmv_mv(const Kokkos::Cuda &exec, const Controls &controls, \ + static void spmv_mv(const Kokkos::Cuda &exec, Handle *handle, \ const char mode[], const coefficient_type &alpha, \ const AMatrix &A, const XVector &x, \ const coefficient_type &beta, const YVector &y) { \ std::string label = "KokkosSparse::spmv[TPL_CUSPARSE," + \ Kokkos::ArithTraits::name() + "]"; \ Kokkos::Profiling::pushRegion(label); \ - spmv_mv_cusparse(exec, controls, mode, alpha, A, x, beta, y); \ + spmv_mv_cusparse(exec, handle, mode, alpha, A, x, beta, y); \ Kokkos::Profiling::popRegion(); \ } \ }; diff --git a/sparse/tpls/KokkosSparse_spmv_tpl_spec_avail.hpp b/sparse/tpls/KokkosSparse_spmv_tpl_spec_avail.hpp index 653ec94811..854c2f2b26 100644 --- a/sparse/tpls/KokkosSparse_spmv_tpl_spec_avail.hpp +++ b/sparse/tpls/KokkosSparse_spmv_tpl_spec_avail.hpp @@ -24,7 +24,8 @@ namespace KokkosSparse { namespace Impl { // Specialization struct which defines whether a specialization exists -template +template struct spmv_tpl_spec_avail { enum : bool { value = false }; }; @@ -40,6 +41,8 @@ struct spmv_tpl_spec_avail { template <> \ struct spmv_tpl_spec_avail< \ Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ KokkosSparse::CrsMatrix< \ const SCALAR, const ORDINAL, Kokkos::Device, \ Kokkos::MemoryTraits, const OFFSET>, \ @@ -187,6 +190,9 @@ KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_CUSPARSE(Kokkos::complex, int64_t, template <> \ struct spmv_tpl_spec_avail< \ Kokkos::HIP, \ + KokkosSparse::Impl::SPMVHandleImpl, \ KokkosSparse::CrsMatrix, \ Kokkos::MemoryTraits, \ @@ -217,22 +223,24 @@ KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ROCSPARSE(Kokkos::complex, #endif // KOKKOSKERNELS_ENABLE_TPL_ROCSPARSE #ifdef KOKKOSKERNELS_ENABLE_TPL_MKL -#define KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_MKL(SCALAR, EXECSPACE) \ - template <> \ - struct spmv_tpl_spec_avail< \ - EXECSPACE, \ - KokkosSparse::CrsMatrix, \ - Kokkos::MemoryTraits, \ - const MKL_INT>, \ - Kokkos::View< \ - const SCALAR*, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>> { \ - enum : bool { value = true }; \ +#define KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_MKL(SCALAR, EXECSPACE) \ + template <> \ + struct spmv_tpl_spec_avail< \ + EXECSPACE, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + KokkosSparse::CrsMatrix, \ + Kokkos::MemoryTraits, \ + const MKL_INT>, \ + Kokkos::View< \ + const SCALAR*, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ }; #ifdef KOKKOS_ENABLE_SERIAL @@ -251,45 +259,57 @@ KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_MKL(Kokkos::complex, Kokkos::OpenMP) #if defined(KOKKOS_ENABLE_SYCL) && \ !defined(KOKKOSKERNELS_ENABLE_TPL_MKL_SYCL_OVERRIDE) -#define KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL(SCALAR, ORDINAL, MEMSPACE) \ - template <> \ - struct spmv_tpl_spec_avail< \ - Kokkos::Experimental::SYCL, \ - KokkosSparse::CrsMatrix< \ - const SCALAR, const ORDINAL, \ - Kokkos::Device, \ - Kokkos::MemoryTraits, const ORDINAL>, \ - Kokkos::View< \ - const SCALAR*, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>> { \ - enum : bool { value = true }; \ +#define KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL(SCALAR, ORDINAL, MEMSPACE) \ + template <> \ + struct spmv_tpl_spec_avail< \ + Kokkos::Experimental::SYCL, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + KokkosSparse::CrsMatrix< \ + const SCALAR, const ORDINAL, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, const ORDINAL>, \ + Kokkos::View< \ + const SCALAR*, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>> { \ + enum : bool { value = true }; \ }; +// intel-oneapi-mkl/2023.2.0: spmv with complex data types produce: +// oneapi::mkl::sparse::gemv: unimplemented functionality: currently does not +// support complex data types. +// TODO: Revisit with later versions and selectively enable this if it's +// working. + KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( float, std::int32_t, Kokkos::Experimental::SYCLDeviceUSMSpace) KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( double, std::int32_t, Kokkos::Experimental::SYCLDeviceUSMSpace) +/* KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( Kokkos::complex, std::int32_t, Kokkos::Experimental::SYCLDeviceUSMSpace) KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( Kokkos::complex, std::int32_t, Kokkos::Experimental::SYCLDeviceUSMSpace) +*/ KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( float, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace) KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( double, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace) +/* KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( Kokkos::complex, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace) KOKKOSSPARSE_SPMV_TPL_SPEC_AVAIL_ONEMKL( Kokkos::complex, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace) +*/ #endif #endif // KOKKOSKERNELS_ENABLE_TPL_MKL diff --git a/sparse/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp b/sparse/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp index a4c50e437f..926d201a52 100644 --- a/sparse/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp +++ b/sparse/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp @@ -18,8 +18,7 @@ #define KOKKOSPARSE_SPMV_TPL_SPEC_DECL_HPP_ #include - -#include "KokkosKernels_Controls.hpp" +#include "KokkosKernels_tpl_handles_decl.hpp" // cuSPARSE #ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE @@ -29,10 +28,8 @@ namespace KokkosSparse { namespace Impl { -template -void spmv_cusparse(const Kokkos::Cuda& exec, - const KokkosKernels::Experimental::Controls& controls, - const char mode[], +template +void spmv_cusparse(const Kokkos::Cuda& exec, Handle* handle, const char mode[], typename YVector::non_const_value_type const& alpha, const AMatrix& A, const XVector& x, typename YVector::non_const_value_type const& beta, @@ -41,9 +38,10 @@ void spmv_cusparse(const Kokkos::Cuda& exec, using value_type = typename AMatrix::non_const_value_type; /* initialize cusparse library */ - cusparseHandle_t cusparseHandle = controls.getCusparseHandle(); + cusparseHandle_t cusparseHandle = + KokkosKernels::Impl::CusparseSingleton::singleton().cusparseHandle; /* Set cuSPARSE to use the given stream until this function exits */ - TemporarySetCusparseStream(cusparseHandle, exec); + TemporarySetCusparseStream tscs(cusparseHandle, exec); /* Set the operation mode */ cusparseOperation_t myCusparseOperation; @@ -65,14 +63,11 @@ void spmv_cusparse(const Kokkos::Cuda& exec, !Kokkos::ArithTraits::isComplex) myCusparseOperation = CUSPARSE_OPERATION_TRANSPOSE; +// Hopefully this corresponds to CUDA reelase 10.1, which is the first to +// include the "generic" API #if defined(CUSPARSE_VERSION) && (10300 <= CUSPARSE_VERSION) using entry_type = typename AMatrix::non_const_ordinal_type; - /* Check that cusparse can handle the types of the input Kokkos::CrsMatrix */ - const cusparseIndexType_t myCusparseOffsetType = - cusparse_index_type_t_from(); - const cusparseIndexType_t myCusparseEntryType = - cusparse_index_type_t_from(); cudaDataType myCudaDataType; if (std::is_same::value) @@ -88,13 +83,11 @@ void spmv_cusparse(const Kokkos::Cuda& exec, "Scalar (data) type of CrsMatrix isn't supported by cuSPARSE, yet TPL " "layer says it is"); - /* create matrix */ - cusparseSpMatDescr_t A_cusparse; - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateCsr( - &A_cusparse, A.numRows(), A.numCols(), A.nnz(), - (void*)A.graph.row_map.data(), (void*)A.graph.entries.data(), - (void*)A.values.data(), myCusparseOffsetType, myCusparseEntryType, - CUSPARSE_INDEX_BASE_ZERO, myCudaDataType)); + /* Check that cusparse can handle the types of the input Kokkos::CrsMatrix */ + const cusparseIndexType_t myCusparseOffsetType = + cusparse_index_type_t_from(); + const cusparseIndexType_t myCusparseEntryType = + cusparse_index_type_t_from(); /* create lhs and rhs */ cusparseDnVecDescr_t vecX, vecY; @@ -103,150 +96,170 @@ void spmv_cusparse(const Kokkos::Cuda& exec, KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateDnVec( &vecY, y.extent_int(0), (void*)y.data(), myCudaDataType)); - size_t bufferSize = 0; - void* dBuffer = NULL; -#if CUSPARSE_VERSION >= 11301 - cusparseSpMVAlg_t alg = CUSPARSE_SPMV_ALG_DEFAULT; -#else - cusparseSpMVAlg_t alg = CUSPARSE_MV_ALG_DEFAULT; -#endif - if (controls.isParameter("algorithm")) { - const std::string algName = controls.getParameter("algorithm"); - if (algName == "default") -#if CUSPARSE_VERSION >= 11301 - alg = CUSPARSE_SPMV_ALG_DEFAULT; + // use default cusparse algo for best performance +#if CUSPARSE_VERSION >= 11400 + cusparseSpMVAlg_t algo = CUSPARSE_SPMV_ALG_DEFAULT; #else - alg = CUSPARSE_MV_ALG_DEFAULT; + cusparseSpMVAlg_t algo = CUSPARSE_MV_ALG_DEFAULT; #endif - else if (algName == "merge") -#if CUSPARSE_VERSION >= 11301 - alg = CUSPARSE_SPMV_CSR_ALG2; + + KokkosSparse::Impl::CuSparse10_SpMV_Data* subhandle; + + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for cusparse"); + } else { + subhandle = new KokkosSparse::Impl::CuSparse10_SpMV_Data(exec); + handle->tpl = subhandle; + + /* create matrix */ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateCsr( + &subhandle->mat, A.numRows(), A.numCols(), A.nnz(), + (void*)A.graph.row_map.data(), (void*)A.graph.entries.data(), + (void*)A.values.data(), myCusparseOffsetType, myCusparseEntryType, + CUSPARSE_INDEX_BASE_ZERO, myCudaDataType)); + + /* size and allocate buffer */ + KOKKOS_CUSPARSE_SAFE_CALL(cusparseSpMV_bufferSize( + cusparseHandle, myCusparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, myCudaDataType, algo, &subhandle->bufferSize)); + // Async memory management introduced in CUDA 11.2 +#if (CUDA_VERSION >= 11020) + KOKKOS_IMPL_CUDA_SAFE_CALL(cudaMallocAsync( + &subhandle->buffer, subhandle->bufferSize, exec.cuda_stream())); #else - alg = CUSPARSE_CSRMV_ALG2; + KOKKOS_IMPL_CUDA_SAFE_CALL( + cudaMalloc(&subhandle->buffer, subhandle->bufferSize)); #endif + handle->is_set_up = true; } - KOKKOS_CUSPARSE_SAFE_CALL(cusparseSpMV_bufferSize( - cusparseHandle, myCusparseOperation, &alpha, A_cusparse, vecX, &beta, - vecY, myCudaDataType, alg, &bufferSize)); - KOKKOS_IMPL_CUDA_SAFE_CALL(cudaMalloc(&dBuffer, bufferSize)); /* perform SpMV */ - KOKKOS_CUSPARSE_SAFE_CALL(cusparseSpMV(cusparseHandle, myCusparseOperation, - &alpha, A_cusparse, vecX, &beta, vecY, - myCudaDataType, alg, dBuffer)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSpMV(cusparseHandle, myCusparseOperation, &alpha, subhandle->mat, + vecX, &beta, vecY, myCudaDataType, algo, subhandle->buffer)); - KOKKOS_IMPL_CUDA_SAFE_CALL(cudaFree(dBuffer)); KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyDnVec(vecX)); KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyDnVec(vecY)); - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroySpMat(A_cusparse)); #elif (9000 <= CUDA_VERSION) - /* create and set the matrix descriptor */ - cusparseMatDescr_t descrA = 0; - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&descrA)); - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)); - KOKKOS_CUSPARSE_SAFE_CALL( - cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO)); + KokkosSparse::Impl::CuSparse9_SpMV_Data* subhandle; - /* perform the actual SpMV operation */ - if (std::is_same::value) { - if (std::is_same::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseScsrmv( - cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - - } else if (std::is_same::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDcsrmv( - cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else if (std::is_same>::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseCcsrmv( - cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else if (std::is_same>::value) { - KOKKOS_CUSPARSE_SAFE_CALL(cusparseZcsrmv( - cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), - A.nnz(), reinterpret_cast(&alpha), descrA, - reinterpret_cast(A.values.data()), - A.graph.row_map.data(), A.graph.entries.data(), - reinterpret_cast(x.data()), - reinterpret_cast(&beta), - reinterpret_cast(y.data()))); - } else { - throw std::logic_error( - "Trying to call cusparse SpMV with a scalar type not float/double, " - "nor complex of either!"); - } + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for cusparse"); } else { - throw std::logic_error( - "With cuSPARSE pre-10.0, offset type must be int. Something wrong with " - "TPL avail logic."); + /* create and set the subhandle and matrix descriptor */ + subhandle = new KokkosSparse::Impl::CuSparse9_SpMV_Data(exec); + handle->tpl = subhandle; + cusparseMatDescr_t descrA = 0; + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCreateMatDescr(&subhandle->mat)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetMatType(subhandle->mat, CUSPARSE_MATRIX_TYPE_GENERAL)); + KOKKOS_CUSPARSE_SAFE_CALL( + cusparseSetMatIndexBase(subhandle->mat, CUSPARSE_INDEX_BASE_ZERO)); + handle->is_set_up = true; } - KOKKOS_CUSPARSE_SAFE_CALL(cusparseDestroyMatDescr(descrA)); + /* perform the actual SpMV operation */ + static_assert( + std::is_same_v, + "With cuSPARSE pre-10.0, offset type must be int. Something wrong with " + "TPL avail logic."); + if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseScsrmv( + cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), A.nnz(), + reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), A.graph.row_map.data(), + A.graph.entries.data(), reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + + } else if constexpr (std::is_same_v) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseDcsrmv( + cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), A.nnz(), + reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + } else if constexpr (std::is_same_v>) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseCcsrmv( + cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), A.nnz(), + reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + } else if constexpr (std::is_same_v>) { + KOKKOS_CUSPARSE_SAFE_CALL(cusparseZcsrmv( + cusparseHandle, myCusparseOperation, A.numRows(), A.numCols(), A.nnz(), + reinterpret_cast(&alpha), subhandle->mat, + reinterpret_cast(A.values.data()), + A.graph.row_map.data(), A.graph.entries.data(), + reinterpret_cast(x.data()), + reinterpret_cast(&beta), + reinterpret_cast(y.data()))); + } else { + static_assert( + static_assert(KokkosKernels::Impl::always_false_v, + "Trying to call cusparse SpMV with a scalar type not float/double, " + "nor complex of either!"); + } #endif // CUDA_VERSION } -#define KOKKOSSPARSE_SPMV_CUSPARSE(SCALAR, ORDINAL, OFFSET, LAYOUT, SPACE, \ - COMPILE_LIBRARY) \ - template <> \ - struct SPMV< \ - Kokkos::Cuda, \ - KokkosSparse::CrsMatrix< \ - SCALAR const, ORDINAL const, Kokkos::Device, \ - Kokkos::MemoryTraits, OFFSET const>, \ - Kokkos::View< \ - SCALAR const*, LAYOUT, Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - true, COMPILE_LIBRARY> { \ - using device_type = Kokkos::Device; \ - using memory_trait_type = Kokkos::MemoryTraits; \ - using AMatrix = CrsMatrix; \ - using XVector = Kokkos::View< \ - SCALAR const*, LAYOUT, device_type, \ - Kokkos::MemoryTraits>; \ - using YVector = \ - Kokkos::View; \ - using Controls = KokkosKernels::Experimental::Controls; \ - \ - using coefficient_type = typename YVector::non_const_value_type; \ - \ - static void spmv(const Kokkos::Cuda& exec, const Controls& controls, \ - const char mode[], const coefficient_type& alpha, \ - const AMatrix& A, const XVector& x, \ - const coefficient_type& beta, const YVector& y) { \ - std::string label = "KokkosSparse::spmv[TPL_CUSPARSE," + \ - Kokkos::ArithTraits::name() + "]"; \ - Kokkos::Profiling::pushRegion(label); \ - spmv_cusparse(exec, controls, mode, alpha, A, x, beta, y); \ - Kokkos::Profiling::popRegion(); \ - } \ +#define KOKKOSSPARSE_SPMV_CUSPARSE(SCALAR, ORDINAL, OFFSET, LAYOUT, SPACE, \ + COMPILE_LIBRARY) \ + template <> \ + struct SPMV< \ + Kokkos::Cuda, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + KokkosSparse::CrsMatrix< \ + SCALAR const, ORDINAL const, Kokkos::Device, \ + Kokkos::MemoryTraits, OFFSET const>, \ + Kokkos::View< \ + SCALAR const*, LAYOUT, Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, COMPILE_LIBRARY> { \ + using device_type = Kokkos::Device; \ + using memory_trait_type = Kokkos::MemoryTraits; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ + using AMatrix = CrsMatrix; \ + using XVector = Kokkos::View< \ + SCALAR const*, LAYOUT, device_type, \ + Kokkos::MemoryTraits>; \ + using YVector = \ + Kokkos::View; \ + using coefficient_type = typename YVector::non_const_value_type; \ + \ + static void spmv(const Kokkos::Cuda& exec, Handle* handle, \ + const char mode[], const coefficient_type& alpha, \ + const AMatrix& A, const XVector& x, \ + const coefficient_type& beta, const YVector& y) { \ + std::string label = "KokkosSparse::spmv[TPL_CUSPARSE," + \ + Kokkos::ArithTraits::name() + "]"; \ + Kokkos::Profiling::pushRegion(label); \ + spmv_cusparse(exec, handle, mode, alpha, A, x, beta, y); \ + Kokkos::Profiling::popRegion(); \ + } \ }; -// BMK: cuSPARSE that comes with CUDA 9 does not support tranpose or conjugate -// transpose modes. No version of cuSPARSE supports mode C (conjugate, non -// transpose). In those cases, fall back to KokkosKernels native spmv. - #if (9000 <= CUDA_VERSION) KOKKOSSPARSE_SPMV_CUSPARSE(double, int, int, Kokkos::LayoutLeft, Kokkos::CudaSpace, @@ -362,10 +375,8 @@ KOKKOSSPARSE_SPMV_CUSPARSE(Kokkos::complex, int64_t, size_t, namespace KokkosSparse { namespace Impl { -template -void spmv_rocsparse(const Kokkos::HIP& exec, - const KokkosKernels::Experimental::Controls& controls, - const char mode[], +template +void spmv_rocsparse(const Kokkos::HIP& exec, Handle* handle, const char mode[], typename YVector::non_const_value_type const& alpha, const AMatrix& A, const XVector& x, typename YVector::non_const_value_type const& beta, @@ -375,9 +386,10 @@ void spmv_rocsparse(const Kokkos::HIP& exec, using value_type = typename AMatrix::non_const_value_type; /* initialize rocsparse library */ - rocsparse_handle handle = controls.getRocsparseHandle(); + rocsparse_handle rocsparseHandle = + KokkosKernels::Impl::RocsparseSingleton::singleton().rocsparseHandle; /* Set rocsparse to use the given stream until this function exits */ - TemporarySetRocsparseStream(handle, exec); + TemporarySetRocsparseStream tsrs(rocsparseHandle, exec); /* Set the operation mode */ rocsparse_operation myRocsparseOperation = mode_kk_to_rocsparse(mode); @@ -389,24 +401,6 @@ void spmv_rocsparse(const Kokkos::HIP& exec, /* Set the scalar type */ rocsparse_datatype compute_type = rocsparse_compute_type(); - /* Create the rocsparse mat and csr descr */ - rocsparse_mat_descr Amat; - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_create_mat_descr(&Amat)); - rocsparse_spmat_descr Aspmat; - // We need to do some casting to void* - // Note that row_map is always a const view so const_cast is necessary, - // however entries and values may not be const so we need to check first. - void* csr_row_ptr = - static_cast(const_cast(A.graph.row_map.data())); - void* csr_col_ind = - static_cast(const_cast(A.graph.entries.data())); - void* csr_val = static_cast(const_cast(A.values.data())); - - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_create_csr_descr( - &Aspmat, A.numRows(), A.numCols(), A.nnz(), csr_row_ptr, csr_col_ind, - csr_val, offset_index_type, entry_index_type, rocsparse_index_base_zero, - compute_type)); - /* Create rocsparse dense vectors for X and Y */ rocsparse_dnvec_descr vecX, vecY; void* x_data = static_cast( @@ -420,99 +414,134 @@ void spmv_rocsparse(const Kokkos::HIP& exec, &vecY, y.extent_int(0), y_data, rocsparse_compute_type())); - /* Actually perform the SpMV operation, first size buffer, then compute result - */ - size_t buffer_size = 0; - void* tmp_buffer = nullptr; rocsparse_spmv_alg alg = rocsparse_spmv_alg_default; - // Note, Dec 6th 2021 - lbv: - // rocSPARSE offers two diffrent algorithms for spmv - // 1. ocsparse_spmv_alg_csr_adaptive - // 2. rocsparse_spmv_alg_csr_stream - // it is unclear which one is the default algorithm - // or what both algorithms are intended for? - if (controls.isParameter("algorithm")) { - const std::string algName = controls.getParameter("algorithm"); - if (algName == "default") - alg = rocsparse_spmv_alg_default; - else if (algName == "merge") - alg = rocsparse_spmv_alg_csr_stream; + + KokkosSparse::Impl::RocSparse_CRS_SpMV_Data* subhandle; + if (handle->is_set_up) { + subhandle = + dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for rocsparse CRS"); + } else { + subhandle = new KokkosSparse::Impl::RocSparse_CRS_SpMV_Data(exec); + handle->tpl = subhandle; + /* Create the rocsparse csr descr */ + // We need to do some casting to void* + // Note that row_map is always a const view so const_cast is necessary, + // however entries and values may not be const so we need to check first. + void* csr_row_ptr = + static_cast(const_cast(A.graph.row_map.data())); + void* csr_col_ind = + static_cast(const_cast(A.graph.entries.data())); + void* csr_val = + static_cast(const_cast(A.values.data())); + + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_create_csr_descr( + &subhandle->mat, A.numRows(), A.numCols(), A.nnz(), csr_row_ptr, + csr_col_ind, csr_val, offset_index_type, entry_index_type, + rocsparse_index_base_zero, compute_type)); + + /* Size and allocate buffer, and analyze the matrix */ + +#if KOKKOSSPARSE_IMPL_ROCM_VERSION >= 60000 + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv( + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, rocsparse_spmv_stage_buffer_size, + &subhandle->bufferSize, nullptr)); + KOKKOS_IMPL_HIP_SAFE_CALL( + hipMalloc(&subhandle->buffer, subhandle->bufferSize)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv( + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, rocsparse_spmv_stage_preprocess, + &subhandle->bufferSize, subhandle->buffer)); +#elif KOKKOSSPARSE_IMPL_ROCM_VERSION >= 50400 + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv_ex( + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, rocsparse_spmv_stage_auto, + &subhandle->bufferSize, nullptr)); + KOKKOS_IMPL_HIP_SAFE_CALL( + hipMalloc(&subhandle->buffer, subhandle->bufferSize)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv_ex( + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, rocsparse_spmv_stage_preprocess, + &subhandle->bufferSize, subhandle->buffer)); +#else + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv( + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, &subhandle->bufferSize, nullptr)); + KOKKOS_IMPL_HIP_SAFE_CALL( + hipMalloc(&subhandle->buffer, subhandle->bufferSize)); +#endif + handle->is_set_up = true; } + /* Perform the actual computation */ + #if KOKKOSSPARSE_IMPL_ROCM_VERSION >= 60000 - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( - rocsparse_spmv(handle, myRocsparseOperation, &alpha, Aspmat, vecX, &beta, - vecY, compute_type, alg, rocsparse_spmv_stage_buffer_size, - &buffer_size, tmp_buffer)); - KOKKOS_IMPL_HIP_SAFE_CALL(hipMalloc(&tmp_buffer, buffer_size)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( - rocsparse_spmv(handle, myRocsparseOperation, &alpha, Aspmat, vecX, &beta, - vecY, compute_type, alg, rocsparse_spmv_stage_compute, - &buffer_size, tmp_buffer)); + KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv( + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, rocsparse_spmv_stage_compute, + &subhandle->bufferSize, subhandle->buffer)); #elif KOKKOSSPARSE_IMPL_ROCM_VERSION >= 50400 KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv_ex( - handle, myRocsparseOperation, &alpha, Aspmat, vecX, &beta, vecY, - compute_type, alg, rocsparse_spmv_stage_auto, &buffer_size, tmp_buffer)); - KOKKOS_IMPL_HIP_SAFE_CALL(hipMalloc(&tmp_buffer, buffer_size)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_spmv_ex( - handle, myRocsparseOperation, &alpha, Aspmat, vecX, &beta, vecY, - compute_type, alg, rocsparse_spmv_stage_auto, &buffer_size, tmp_buffer)); + rocsparseHandle, myRocsparseOperation, &alpha, subhandle->mat, vecX, + &beta, vecY, compute_type, alg, rocsparse_spmv_stage_compute, + &subhandle->bufferSize, subhandle->buffer)); #else KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( - rocsparse_spmv(handle, myRocsparseOperation, &alpha, Aspmat, vecX, &beta, - vecY, compute_type, alg, &buffer_size, tmp_buffer)); - KOKKOS_IMPL_HIP_SAFE_CALL(hipMalloc(&tmp_buffer, buffer_size)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL( - rocsparse_spmv(handle, myRocsparseOperation, &alpha, Aspmat, vecX, &beta, - vecY, compute_type, alg, &buffer_size, tmp_buffer)); + rocsparse_spmv(rocsparseHandle, myRocsparseOperation, &alpha, + subhandle->mat, vecX, &beta, vecY, compute_type, alg, + &subhandle->bufferSize, subhandle->buffer)); #endif - KOKKOS_IMPL_HIP_SAFE_CALL(hipFree(tmp_buffer)); KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_dnvec_descr(vecY)); KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_dnvec_descr(vecX)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_spmat_descr(Aspmat)); - KOKKOS_ROCSPARSE_SAFE_CALL_IMPL(rocsparse_destroy_mat_descr(Amat)); } -#define KOKKOSSPARSE_SPMV_ROCSPARSE(SCALAR, LAYOUT, COMPILE_LIBRARY) \ - template <> \ - struct SPMV< \ - Kokkos::HIP, \ - KokkosSparse::CrsMatrix, \ - Kokkos::MemoryTraits, \ - rocsparse_int const>, \ - Kokkos::View< \ - SCALAR const*, LAYOUT, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - true, COMPILE_LIBRARY> { \ - using device_type = Kokkos::Device; \ - using memory_trait_type = Kokkos::MemoryTraits; \ - using AMatrix = CrsMatrix; \ - using XVector = Kokkos::View< \ - SCALAR const*, LAYOUT, device_type, \ - Kokkos::MemoryTraits>; \ - using YVector = \ - Kokkos::View; \ - using Controls = KokkosKernels::Experimental::Controls; \ - \ - using coefficient_type = typename YVector::non_const_value_type; \ - \ - static void spmv(const Kokkos::HIP& exec, const Controls& controls, \ - const char mode[], const coefficient_type& alpha, \ - const AMatrix& A, const XVector& x, \ - const coefficient_type& beta, const YVector& y) { \ - std::string label = "KokkosSparse::spmv[TPL_ROCSPARSE," + \ - Kokkos::ArithTraits::name() + "]"; \ - Kokkos::Profiling::pushRegion(label); \ - spmv_rocsparse(exec, controls, mode, alpha, A, x, beta, y); \ - Kokkos::Profiling::popRegion(); \ - } \ +#define KOKKOSSPARSE_SPMV_ROCSPARSE(SCALAR, LAYOUT, COMPILE_LIBRARY) \ + template <> \ + struct SPMV< \ + Kokkos::HIP, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + KokkosSparse::CrsMatrix, \ + Kokkos::MemoryTraits, \ + rocsparse_int const>, \ + Kokkos::View< \ + SCALAR const*, LAYOUT, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, COMPILE_LIBRARY> { \ + using device_type = Kokkos::Device; \ + using memory_trait_type = Kokkos::MemoryTraits; \ + using Handle = KokkosSparse::Impl::SPMVHandleImpl< \ + Kokkos::HIP, Kokkos::HIPSpace, SCALAR, rocsparse_int, rocsparse_int>; \ + using AMatrix = CrsMatrix; \ + using XVector = Kokkos::View< \ + SCALAR const*, LAYOUT, device_type, \ + Kokkos::MemoryTraits>; \ + using YVector = \ + Kokkos::View; \ + \ + using coefficient_type = typename YVector::non_const_value_type; \ + \ + static void spmv(const Kokkos::HIP& exec, Handle* handle, \ + const char mode[], const coefficient_type& alpha, \ + const AMatrix& A, const XVector& x, \ + const coefficient_type& beta, const YVector& y) { \ + std::string label = "KokkosSparse::spmv[TPL_ROCSPARSE," + \ + Kokkos::ArithTraits::name() + "]"; \ + Kokkos::Profiling::pushRegion(label); \ + spmv_rocsparse(exec, handle, mode, alpha, A, x, beta, y); \ + Kokkos::Profiling::popRegion(); \ + } \ }; KOKKOSSPARSE_SPMV_ROCSPARSE(double, Kokkos::LayoutLeft, @@ -548,82 +577,77 @@ namespace Impl { #if (__INTEL_MKL__ > 2017) // MKL 2018 and above: use new interface: sparse_matrix_t and mkl_sparse_?_mv() -inline void spmv_mkl(sparse_operation_t op, float alpha, float beta, MKL_INT m, - MKL_INT n, const MKL_INT* Arowptrs, - const MKL_INT* Aentries, const float* Avalues, - const float* x, float* y) { - sparse_matrix_t A_mkl; - matrix_descr A_descr; - A_descr.type = SPARSE_MATRIX_TYPE_GENERAL; - A_descr.mode = SPARSE_FILL_MODE_FULL; - A_descr.diag = SPARSE_DIAG_NON_UNIT; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_create_csr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, m, n, const_cast(Arowptrs), - const_cast(Arowptrs + 1), const_cast(Aentries), - const_cast(Avalues))); - KOKKOSKERNELS_MKL_SAFE_CALL( - mkl_sparse_s_mv(op, alpha, A_mkl, A_descr, x, beta, y)); -} - -inline void spmv_mkl(sparse_operation_t op, double alpha, double beta, - MKL_INT m, MKL_INT n, const MKL_INT* Arowptrs, - const MKL_INT* Aentries, const double* Avalues, - const double* x, double* y) { - sparse_matrix_t A_mkl; - matrix_descr A_descr; - A_descr.type = SPARSE_MATRIX_TYPE_GENERAL; - A_descr.mode = SPARSE_FILL_MODE_FULL; - A_descr.diag = SPARSE_DIAG_NON_UNIT; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_create_csr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, m, n, const_cast(Arowptrs), - const_cast(Arowptrs + 1), const_cast(Aentries), - const_cast(Avalues))); - KOKKOSKERNELS_MKL_SAFE_CALL( - mkl_sparse_d_mv(op, alpha, A_mkl, A_descr, x, beta, y)); -} - -inline void spmv_mkl(sparse_operation_t op, Kokkos::complex alpha, - Kokkos::complex beta, MKL_INT m, MKL_INT n, - const MKL_INT* Arowptrs, const MKL_INT* Aentries, - const Kokkos::complex* Avalues, - const Kokkos::complex* x, - Kokkos::complex* y) { - sparse_matrix_t A_mkl; - matrix_descr A_descr; - A_descr.type = SPARSE_MATRIX_TYPE_GENERAL; - A_descr.mode = SPARSE_FILL_MODE_FULL; - A_descr.diag = SPARSE_DIAG_NON_UNIT; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_create_csr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, m, n, const_cast(Arowptrs), - const_cast(Arowptrs + 1), const_cast(Aentries), - (MKL_Complex8*)Avalues)); - MKL_Complex8 alpha_mkl{alpha.real(), alpha.imag()}; - MKL_Complex8 beta_mkl{beta.real(), beta.imag()}; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_mv( - op, alpha_mkl, A_mkl, A_descr, reinterpret_cast(x), - beta_mkl, reinterpret_cast(y))); -} - -inline void spmv_mkl(sparse_operation_t op, Kokkos::complex alpha, - Kokkos::complex beta, MKL_INT m, MKL_INT n, - const MKL_INT* Arowptrs, const MKL_INT* Aentries, - const Kokkos::complex* Avalues, - const Kokkos::complex* x, - Kokkos::complex* y) { - sparse_matrix_t A_mkl; - matrix_descr A_descr; - A_descr.type = SPARSE_MATRIX_TYPE_GENERAL; - A_descr.mode = SPARSE_FILL_MODE_FULL; - A_descr.diag = SPARSE_DIAG_NON_UNIT; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_create_csr( - &A_mkl, SPARSE_INDEX_BASE_ZERO, m, n, const_cast(Arowptrs), - const_cast(Arowptrs + 1), const_cast(Aentries), - (MKL_Complex16*)Avalues)); - MKL_Complex16 alpha_mkl{alpha.real(), alpha.imag()}; - MKL_Complex16 beta_mkl{beta.real(), beta.imag()}; - KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_mv( - op, alpha_mkl, A_mkl, A_descr, reinterpret_cast(x), - beta_mkl, reinterpret_cast(y))); +// Note: Scalar here is the Kokkos type, not the MKL type +template +inline void spmv_mkl(Handle* handle, sparse_operation_t op, Scalar alpha, + Scalar beta, MKL_INT m, MKL_INT n, const MKL_INT* Arowptrs, + const MKL_INT* Aentries, const Scalar* Avalues, + const Scalar* x, Scalar* y) { + using MKLScalar = typename KokkosToMKLScalar::type; + using ExecSpace = typename Handle::ExecutionSpaceType; + using Subhandle = MKL_SpMV_Data; + Subhandle* subhandle; + const MKLScalar* x_mkl = reinterpret_cast(x); + MKLScalar* y_mkl = reinterpret_cast(y); + if (handle->is_set_up) { + subhandle = dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for MKL CRS"); + } else { + // Use the default execution space instance, as classic MKL does not use + // a specific instance. + subhandle = new Subhandle(ExecSpace()); + handle->tpl = subhandle; + subhandle->descr.type = SPARSE_MATRIX_TYPE_GENERAL; + subhandle->descr.mode = SPARSE_FILL_MODE_FULL; + subhandle->descr.diag = SPARSE_DIAG_NON_UNIT; + // Note: the create_csr routine requires non-const values even though + // they're not actually modified + MKLScalar* Avalues_mkl = + reinterpret_cast(const_cast(Avalues)); + if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_create_csr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, m, n, + const_cast(Arowptrs), const_cast(Arowptrs + 1), + const_cast(Aentries), Avalues_mkl)); + } else if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_create_csr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, m, n, + const_cast(Arowptrs), const_cast(Arowptrs + 1), + const_cast(Aentries), Avalues_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_create_csr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, m, n, + const_cast(Arowptrs), const_cast(Arowptrs + 1), + const_cast(Aentries), Avalues_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_create_csr( + &subhandle->mat, SPARSE_INDEX_BASE_ZERO, m, n, + const_cast(Arowptrs), const_cast(Arowptrs + 1), + const_cast(Aentries), Avalues_mkl)); + } + handle->is_set_up = true; + } + MKLScalar alpha_mkl = KokkosToMKLScalar(alpha); + MKLScalar beta_mkl = KokkosToMKLScalar(beta); + if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_s_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } else if constexpr (std::is_same_v) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_d_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_c_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } else if constexpr (std::is_same_v>) { + KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_z_mv(op, alpha_mkl, subhandle->mat, + subhandle->descr, x_mkl, + beta_mkl, y_mkl)); + } } // Note: classic MKL runs on Serial/OpenMP but can't use our execution space @@ -631,6 +655,8 @@ inline void spmv_mkl(sparse_operation_t op, Kokkos::complex alpha, #define KOKKOSSPARSE_SPMV_MKL(SCALAR, EXECSPACE, COMPILE_LIBRARY) \ template <> \ struct SPMV, \ KokkosSparse::CrsMatrix< \ SCALAR const, MKL_INT const, \ Kokkos::Device, \ @@ -644,6 +670,9 @@ inline void spmv_mkl(sparse_operation_t op, Kokkos::complex alpha, Kokkos::MemoryTraits>, \ true, COMPILE_LIBRARY> { \ using device_type = Kokkos::Device; \ + using Handle = \ + KokkosSparse::Impl::SPMVHandleImpl; \ using AMatrix = \ CrsMatrix, MKL_INT const>; \ @@ -653,17 +682,16 @@ inline void spmv_mkl(sparse_operation_t op, Kokkos::complex alpha, using YVector = Kokkos::View>; \ using coefficient_type = typename YVector::non_const_value_type; \ - using Controls = KokkosKernels::Experimental::Controls; \ \ - static void spmv(const EXECSPACE&, const Controls&, const char mode[], \ + static void spmv(const EXECSPACE&, Handle* handle, const char mode[], \ const coefficient_type& alpha, const AMatrix& A, \ const XVector& x, const coefficient_type& beta, \ const YVector& y) { \ std::string label = "KokkosSparse::spmv[TPL_MKL," + \ Kokkos::ArithTraits::name() + "]"; \ Kokkos::Profiling::pushRegion(label); \ - spmv_mkl(mode_kk_to_mkl(mode[0]), alpha, beta, A.numRows(), A.numCols(), \ - A.graph.row_map.data(), A.graph.entries.data(), \ + spmv_mkl(handle, mode_kk_to_mkl(mode[0]), alpha, beta, A.numRows(), \ + A.numCols(), A.graph.row_map.data(), A.graph.entries.data(), \ A.values.data(), x.data(), y.data()); \ Kokkos::Profiling::popRegion(); \ } \ @@ -705,122 +733,103 @@ inline oneapi::mkl::transpose mode_kk_to_onemkl(char mode_kk) { "Invalid mode for oneMKL (should be one of N, T, H)"); } -template -struct spmv_onemkl_wrapper {}; - -template <> -struct spmv_onemkl_wrapper { - template - static void spmv(const execution_space& exec, oneapi::mkl::transpose mkl_mode, - typename matrix_type::non_const_value_type const alpha, - const matrix_type& A, const xview_type& x, - typename matrix_type::non_const_value_type const beta, - const yview_type& y) { - using scalar_type = typename matrix_type::non_const_value_type; - using ordinal_type = typename matrix_type::non_const_ordinal_type; - - // oneAPI doesn't directly support mode H with real values, but this is - // equivalent to mode T - if (mkl_mode == oneapi::mkl::transpose::conjtrans && - !Kokkos::ArithTraits::isComplex) - mkl_mode = oneapi::mkl::transpose::trans; - - oneapi::mkl::sparse::matrix_handle_t handle = nullptr; - oneapi::mkl::sparse::init_matrix_handle(&handle); - auto ev_set = oneapi::mkl::sparse::set_csr_data( - exec.sycl_queue(), handle, A.numRows(), A.numCols(), +template +inline void spmv_onemkl(const execution_space& exec, Handle* handle, + oneapi::mkl::transpose mkl_mode, + typename matrix_type::non_const_value_type const alpha, + const matrix_type& A, const xview_type& x, + typename matrix_type::non_const_value_type const beta, + const yview_type& y) { + using scalar_type = typename matrix_type::non_const_value_type; + using onemkl_scalar_type = typename KokkosToOneMKLScalar::type; + using ordinal_type = typename matrix_type::non_const_ordinal_type; + + // oneAPI doesn't directly support mode H with real values, but this is + // equivalent to mode T + if (mkl_mode == oneapi::mkl::transpose::conjtrans && + !Kokkos::ArithTraits::isComplex) + mkl_mode = oneapi::mkl::transpose::trans; + + OneMKL_SpMV_Data* subhandle; + if (handle->is_set_up) { + subhandle = dynamic_cast(handle->tpl); + if (!subhandle) + throw std::runtime_error( + "KokkosSparse::spmv: subhandle is not set up for OneMKL CRS"); + } else { + subhandle = new OneMKL_SpMV_Data(exec); + handle->tpl = subhandle; + oneapi::mkl::sparse::init_matrix_handle(&subhandle->mat); + // Even for out-of-order SYCL queue, the inputs here do not depend on + // kernels being sequenced + auto ev = oneapi::mkl::sparse::set_csr_data( + exec.sycl_queue(), subhandle->mat, A.numRows(), A.numCols(), oneapi::mkl::index_base::zero, const_cast(A.graph.row_map.data()), const_cast(A.graph.entries.data()), - const_cast(A.values.data())); - auto ev_opt = oneapi::mkl::sparse::optimize_gemv( - exec.sycl_queue(), mkl_mode, handle, {ev_set}); - auto ev_gemv = - oneapi::mkl::sparse::gemv(exec.sycl_queue(), mkl_mode, alpha, handle, - x.data(), beta, y.data(), {ev_opt}); - auto ev_release = oneapi::mkl::sparse::release_matrix_handle( - exec.sycl_queue(), &handle, {ev_gemv}); - ev_release.wait(); - } -}; - -template <> -struct spmv_onemkl_wrapper { - template - static void spmv(const execution_space& exec, oneapi::mkl::transpose mkl_mode, - typename matrix_type::non_const_value_type const alpha, - const matrix_type& A, const xview_type& x, - typename matrix_type::non_const_value_type const beta, - const yview_type& y) { - using scalar_type = typename matrix_type::non_const_value_type; - using ordinal_type = typename matrix_type::non_const_ordinal_type; - using mag_type = typename Kokkos::ArithTraits::mag_type; - - oneapi::mkl::sparse::matrix_handle_t handle = nullptr; - oneapi::mkl::sparse::init_matrix_handle(&handle); - auto ev_set = oneapi::mkl::sparse::set_csr_data( - exec.sycl_queue(), handle, static_cast(A.numRows()), - static_cast(A.numCols()), oneapi::mkl::index_base::zero, - const_cast(A.graph.row_map.data()), - const_cast(A.graph.entries.data()), - reinterpret_cast*>( + reinterpret_cast( const_cast(A.values.data()))); - auto ev_opt = oneapi::mkl::sparse::optimize_gemv( - exec.sycl_queue(), mkl_mode, handle, {ev_set}); - auto ev_gemv = oneapi::mkl::sparse::gemv( - exec.sycl_queue(), mkl_mode, alpha, handle, - reinterpret_cast*>( - const_cast(x.data())), - beta, reinterpret_cast*>(y.data()), {ev_opt}); - auto ev_release = oneapi::mkl::sparse::release_matrix_handle( - exec.sycl_queue(), &handle, {ev_gemv}); - ev_release.wait(); + // for out-of-order queue: the fence before gemv below will make sure + // optimize_gemv has finished + oneapi::mkl::sparse::optimize_gemv(exec.sycl_queue(), mkl_mode, + subhandle->mat, {ev}); + handle->is_set_up = true; } -}; - -#define KOKKOSSPARSE_SPMV_ONEMKL(SCALAR, ORDINAL, MEMSPACE, COMPILE_LIBRARY) \ - template <> \ - struct SPMV< \ - Kokkos::Experimental::SYCL, \ - KokkosSparse::CrsMatrix< \ - SCALAR const, ORDINAL const, \ - Kokkos::Device, \ - Kokkos::MemoryTraits, ORDINAL const>, \ - Kokkos::View< \ - SCALAR const*, Kokkos::LayoutLeft, \ - Kokkos::Device, \ - Kokkos::MemoryTraits>, \ - Kokkos::View, \ - Kokkos::MemoryTraits>, \ - true, COMPILE_LIBRARY> { \ - using execution_space = Kokkos::Experimental::SYCL; \ - using device_type = Kokkos::Device; \ - using AMatrix = \ - CrsMatrix, ORDINAL const>; \ - using XVector = Kokkos::View< \ - SCALAR const*, Kokkos::LayoutLeft, device_type, \ - Kokkos::MemoryTraits>; \ - using YVector = Kokkos::View>; \ - using coefficient_type = typename YVector::non_const_value_type; \ - using Controls = KokkosKernels::Experimental::Controls; \ - \ - static void spmv(const execution_space& exec, const Controls&, \ - const char mode[], const coefficient_type& alpha, \ - const AMatrix& A, const XVector& x, \ - const coefficient_type& beta, const YVector& y) { \ - std::string label = "KokkosSparse::spmv[TPL_ONEMKL," + \ - Kokkos::ArithTraits::name() + "]"; \ - Kokkos::Profiling::pushRegion(label); \ - oneapi::mkl::transpose mkl_mode = mode_kk_to_onemkl(mode[0]); \ - spmv_onemkl_wrapper::is_complex>::spmv( \ - exec, mkl_mode, alpha, A, x, beta, y); \ - Kokkos::Profiling::popRegion(); \ - } \ + + // Uncommon case: an out-of-order SYCL queue does not promise that previously + // enqueued kernels finish before starting this one. So fence exec to get the + // expected semantics. + if (!exec.sycl_queue().is_in_order()) exec.fence(); + oneapi::mkl::sparse::gemv( + exec.sycl_queue(), mkl_mode, alpha, subhandle->mat, + reinterpret_cast(x.data()), beta, + reinterpret_cast(y.data())); +} + +#define KOKKOSSPARSE_SPMV_ONEMKL(SCALAR, ORDINAL, MEMSPACE, COMPILE_LIBRARY) \ + template <> \ + struct SPMV< \ + Kokkos::Experimental::SYCL, \ + KokkosSparse::Impl::SPMVHandleImpl, \ + KokkosSparse::CrsMatrix< \ + SCALAR const, ORDINAL const, \ + Kokkos::Device, \ + Kokkos::MemoryTraits, ORDINAL const>, \ + Kokkos::View< \ + SCALAR const*, Kokkos::LayoutLeft, \ + Kokkos::Device, \ + Kokkos::MemoryTraits>, \ + Kokkos::View, \ + Kokkos::MemoryTraits>, \ + true, COMPILE_LIBRARY> { \ + using execution_space = Kokkos::Experimental::SYCL; \ + using device_type = Kokkos::Device; \ + using Handle = KokkosSparse::Impl::SPMVHandleImpl< \ + Kokkos::Experimental::SYCL, MEMSPACE, SCALAR, ORDINAL, ORDINAL>; \ + using AMatrix = \ + CrsMatrix, ORDINAL const>; \ + using XVector = Kokkos::View< \ + SCALAR const*, Kokkos::LayoutLeft, device_type, \ + Kokkos::MemoryTraits>; \ + using YVector = Kokkos::View>; \ + using coefficient_type = typename YVector::non_const_value_type; \ + \ + static void spmv(const execution_space& exec, Handle* handle, \ + const char mode[], const coefficient_type& alpha, \ + const AMatrix& A, const XVector& x, \ + const coefficient_type& beta, const YVector& y) { \ + std::string label = "KokkosSparse::spmv[TPL_ONEMKL," + \ + Kokkos::ArithTraits::name() + "]"; \ + Kokkos::Profiling::pushRegion(label); \ + oneapi::mkl::transpose mkl_mode = mode_kk_to_onemkl(mode[0]); \ + spmv_onemkl(exec, handle, mkl_mode, alpha, A, x, beta, y); \ + Kokkos::Profiling::popRegion(); \ + } \ }; KOKKOSSPARSE_SPMV_ONEMKL(float, std::int32_t, @@ -829,12 +838,14 @@ KOKKOSSPARSE_SPMV_ONEMKL(float, std::int32_t, KOKKOSSPARSE_SPMV_ONEMKL(double, std::int32_t, Kokkos::Experimental::SYCLDeviceUSMSpace, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) +/* KOKKOSSPARSE_SPMV_ONEMKL(Kokkos::complex, std::int32_t, - Kokkos::Experimental::SYCLDeviceUSMSpace, - KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) + Kokkos::Experimental::SYCLDeviceUSMSpace, + KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) KOKKOSSPARSE_SPMV_ONEMKL(Kokkos::complex, std::int32_t, - Kokkos::Experimental::SYCLDeviceUSMSpace, - KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) + Kokkos::Experimental::SYCLDeviceUSMSpace, + KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) +*/ KOKKOSSPARSE_SPMV_ONEMKL(float, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace, @@ -842,12 +853,14 @@ KOKKOSSPARSE_SPMV_ONEMKL(float, std::int64_t, KOKKOSSPARSE_SPMV_ONEMKL(double, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) +/* KOKKOSSPARSE_SPMV_ONEMKL(Kokkos::complex, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) KOKKOSSPARSE_SPMV_ONEMKL(Kokkos::complex, std::int64_t, Kokkos::Experimental::SYCLDeviceUSMSpace, KOKKOSKERNELS_IMPL_COMPILE_LIBRARY) +*/ #endif } // namespace Impl } // namespace KokkosSparse diff --git a/sparse/unit_test/Test_Sparse.hpp b/sparse/unit_test/Test_Sparse.hpp index 8ae06b598a..624cd86ff5 100644 --- a/sparse/unit_test/Test_Sparse.hpp +++ b/sparse/unit_test/Test_Sparse.hpp @@ -16,9 +16,7 @@ #ifndef TEST_SPARSE_HPP #define TEST_SPARSE_HPP -#if KOKKOS_VERSION >= 40099 #include "Test_Sparse_coo2crs.hpp" -#endif // KOKKOS_VERSION >= 40099 #include "Test_Sparse_crs2coo.hpp" #include "Test_Sparse_Controls.hpp" #include "Test_Sparse_CrsMatrix.hpp" diff --git a/sparse/unit_test/Test_Sparse_bspgemm.hpp b/sparse/unit_test/Test_Sparse_bspgemm.hpp index d3c3a6134f..58a2a18b8a 100644 --- a/sparse/unit_test/Test_Sparse_bspgemm.hpp +++ b/sparse/unit_test/Test_Sparse_bspgemm.hpp @@ -159,15 +159,6 @@ void test_bspgemm(lno_t blkDim, lno_t m, lno_t k, lno_t n, size_type nnz, return; } #endif // KOKKOSKERNELS_ENABLE_TPL_ARMPL -#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) && (CUSPARSE_VERSION < 11600) - { - std::cerr - << "TEST SKIPPED: See " - "https://github.com/kokkos/kokkos-kernels/issues/1965 for details." - << std::endl; - return; - } -#endif using namespace Test; // device::execution_space::initialize(); // device::execution_space::print_configuration(std::cout); diff --git a/sparse/unit_test/Test_Sparse_extractCrsDiagonalBlocks.hpp b/sparse/unit_test/Test_Sparse_extractCrsDiagonalBlocks.hpp index 327780dec3..28674ad353 100644 --- a/sparse/unit_test/Test_Sparse_extractCrsDiagonalBlocks.hpp +++ b/sparse/unit_test/Test_Sparse_extractCrsDiagonalBlocks.hpp @@ -15,6 +15,8 @@ //@HEADER #include "KokkosSparse_Utils.hpp" +#include "KokkosSparse_spmv.hpp" +#include "KokkosBlas1_nrm2.hpp" #include "KokkosKernels_TestUtils.hpp" namespace Test { @@ -31,6 +33,7 @@ void run_test_extract_diagonal_blocks(int nrows, int nblocks) { crsMat_t A; std::vector DiagBlks(nblocks); + std::vector DiagBlks_rcm(nblocks); if (nrows != 0) { // Generate test matrix @@ -84,6 +87,10 @@ void run_test_extract_diagonal_blocks(int nrows, int nblocks) { KokkosSparse::Impl::kk_extract_diagonal_blocks_crsmatrix_sequential(A, DiagBlks); + auto perm = + KokkosSparse::Impl::kk_extract_diagonal_blocks_crsmatrix_sequential( + A, DiagBlks_rcm, true); + // Checking lno_t numRows = 0; lno_t numCols = 0; @@ -125,6 +132,40 @@ void run_test_extract_diagonal_blocks(int nrows, int nblocks) { col_start += DiagBlks[i].numCols(); } EXPECT_TRUE(flag); + + // Checking RCM + if (!perm.empty()) { + scalar_t one = scalar_t(1.0); + scalar_t zero = scalar_t(0.0); + scalar_t mone = scalar_t(-1.0); + for (int i = 0; i < nblocks; i++) { + ValuesType In("In", DiagBlks[i].numRows()); + ValuesType Out("Out", DiagBlks[i].numRows()); + + ValuesType_hm h_Out = Kokkos::create_mirror_view(Out); + ValuesType_hm h_Out_tmp = Kokkos::create_mirror(Out); + + Kokkos::deep_copy(In, one); + + auto h_perm = + Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), perm[i]); + + KokkosSparse::spmv("N", one, DiagBlks_rcm[i], In, zero, Out); + + Kokkos::deep_copy(h_Out_tmp, Out); + for (lno_t ii = 0; ii < static_cast(DiagBlks[i].numRows()); + ii++) { + lno_t rcm_ii = h_perm(ii); + h_Out(ii) = h_Out_tmp(rcm_ii); + } + Kokkos::deep_copy(Out, h_Out); + + KokkosSparse::spmv("N", one, DiagBlks[i], In, mone, Out); + + double nrm_val = KokkosBlas::nrm2(Out); + EXPECT_LE(nrm_val, 1e-9); + } + } } } } // namespace Test @@ -136,9 +177,9 @@ void test_extract_diagonal_blocks() { Test::run_test_extract_diagonal_blocks( 0, s); Test::run_test_extract_diagonal_blocks( - 12, s); + 153, s); Test::run_test_extract_diagonal_blocks( - 123, s); + 1553, s); } } diff --git a/sparse/unit_test/Test_Sparse_gauss_seidel.hpp b/sparse/unit_test/Test_Sparse_gauss_seidel.hpp index 35fbcb44a4..48c7d41a91 100644 --- a/sparse/unit_test/Test_Sparse_gauss_seidel.hpp +++ b/sparse/unit_test/Test_Sparse_gauss_seidel.hpp @@ -356,7 +356,7 @@ void test_gauss_seidel_rank2(lno_t numRows, size_type nnz, lno_t bandwidth, // Zero out X before solving Kokkos::deep_copy(x_vector, zero); run_gauss_seidel(input_mat, GS_CLUSTER, x_vector, y_vector, symmetric, - apply_type, clusterSizes[csize], + apply_type, clusterSizes[csize], false, (ClusteringAlgorithm)algo); Kokkos::deep_copy(x_host, x_vector); for (lno_t i = 0; i < numVecs; i++) { @@ -752,17 +752,8 @@ void test_gauss_seidel_streams_rank1( } #endif // KOKKOS_ENABLE_OPENMP - std::vector instances; - if (nstreams == 1) - instances = Kokkos::Experimental::partition_space(execution_space(), 1); - else if (nstreams == 2) - instances = Kokkos::Experimental::partition_space(execution_space(), 1, 1); - else if (nstreams == 3) - instances = - Kokkos::Experimental::partition_space(execution_space(), 1, 1, 1); - else - instances = - Kokkos::Experimental::partition_space(execution_space(), 1, 1, 1, 1); + auto instances = Kokkos::Experimental::partition_space( + execution_space(), std::vector(nstreams, 1)); std::vector kh_v(nstreams); std::vector input_mat_v(nstreams); diff --git a/sparse/unit_test/Test_Sparse_gmres.hpp b/sparse/unit_test/Test_Sparse_gmres.hpp index 1990087526..ee78d27729 100644 --- a/sparse/unit_test/Test_Sparse_gmres.hpp +++ b/sparse/unit_test/Test_Sparse_gmres.hpp @@ -48,120 +48,163 @@ struct TolMeta { static constexpr float value = 1e-5; // Lower tolerance for floats }; +template ::value>::type* = nullptr> +AType get_A(int n, int diagDominance, int) { + using lno_t = typename Crs::ordinal_type; + typename Crs::non_const_size_type nnz = 10 * n; + auto A = + KokkosSparse::Impl::kk_generate_diagonally_dominant_sparse_matrix( + n, n, nnz, 0, lno_t(0.01 * n), diagDominance); + KokkosSparse::sort_crs_matrix(A); + + return A; +} + +template ::value>::type* = nullptr> +AType get_A(int n, int diagDominance, int block_size) { + using lno_t = typename Crs::ordinal_type; + typename Crs::non_const_size_type nnz = 10 * n; + auto A_unblocked = + KokkosSparse::Impl::kk_generate_diagonally_dominant_sparse_matrix( + n, n, nnz, 0, lno_t(0.01 * n), diagDominance); + KokkosSparse::sort_crs_matrix(A_unblocked); + + // Convert to BSR + AType A(A_unblocked, block_size); + + return A; +} + template -void run_test_gmres() { - using exe_space = typename device::execution_space; - using mem_space = typename device::memory_space; - using sp_matrix_type = - KokkosSparse::CrsMatrix; +struct GmresTest { + using RowMapType = Kokkos::View; + using EntriesType = Kokkos::View; + using ValuesType = Kokkos::View; + using AT = Kokkos::ArithTraits; + using exe_space = typename device::execution_space; + using mem_space = typename device::memory_space; + + using Crs = CrsMatrix; + using Bsr = BsrMatrix; + using KernelHandle = KokkosKernels::Experimental::KokkosKernelsHandle< size_type, lno_t, scalar_t, exe_space, mem_space, mem_space>; using float_t = typename Kokkos::ArithTraits::mag_type; - // Create a diagonally dominant sparse matrix to test: - constexpr auto n = 5000; - constexpr auto m = 15; - constexpr auto tol = TolMeta::value; - constexpr auto numRows = n; - constexpr auto numCols = n; - constexpr auto diagDominance = 1; - constexpr bool verbose = false; - - typename sp_matrix_type::non_const_size_type nnz = 10 * numRows; - auto A = KokkosSparse::Impl::kk_generate_diagonally_dominant_sparse_matrix< - sp_matrix_type>(numRows, numCols, nnz, 0, lno_t(0.01 * numRows), - diagDominance); - - // Make kernel handles - KernelHandle kh; - kh.create_gmres_handle(m, tol); - auto gmres_handle = kh.get_gmres_handle(); - using GMRESHandle = - typename std::remove_reference::type; - using ViewVectorType = typename GMRESHandle::nnz_value_view_t; - - // Set initial vectors: - ViewVectorType X("X", n); // Solution and initial guess - ViewVectorType Wj("Wj", n); // For checking residuals at end. - ViewVectorType B(Kokkos::view_alloc(Kokkos::WithoutInitializing, "B"), - n); // right-hand side vec - // Make rhs ones so that results are repeatable: - Kokkos::deep_copy(B, 1.0); - - gmres_handle->set_verbose(verbose); - - // Test CGS2 - { - gmres(&kh, A, B, X); - - // Double check residuals at end of solve: - float_t nrmB = KokkosBlas::nrm2(B); - KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax - KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. - float_t endRes = KokkosBlas::nrm2(B) / nrmB; - - const auto conv_flag = gmres_handle->get_conv_flag_val(); - - EXPECT_LT(endRes, gmres_handle->get_tol()); - EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); - } + template + static void run_test_gmres() { + using sp_matrix_type = std::conditional_t; + + // Create a diagonally dominant sparse matrix to test: + constexpr auto n = 5000; + constexpr auto m = 15; + constexpr auto tol = TolMeta::value; + constexpr auto diagDominance = 1; + constexpr bool verbose = false; + constexpr auto block_size = UseBlocks ? 10 : 1; + + auto A = get_A(n, diagDominance, block_size); + + if (verbose) { + std::cout << "Running GMRES test with block_size=" << block_size + << std::endl; + } + + // Make kernel handles + KernelHandle kh; + kh.create_gmres_handle(m, tol); + auto gmres_handle = kh.get_gmres_handle(); + using GMRESHandle = + typename std::remove_reference::type; + using ViewVectorType = typename GMRESHandle::nnz_value_view_t; + + // Set initial vectors: + ViewVectorType X("X", n); // Solution and initial guess + ViewVectorType Wj("Wj", n); // For checking residuals at end. + ViewVectorType B(Kokkos::view_alloc(Kokkos::WithoutInitializing, "B"), + n); // right-hand side vec + // Make rhs ones so that results are repeatable: + Kokkos::deep_copy(B, 1.0); - // Test MGS - { - gmres_handle->reset_handle(m, tol); - gmres_handle->set_ortho(GMRESHandle::Ortho::MGS); gmres_handle->set_verbose(verbose); - // reset X for next gmres call - Kokkos::deep_copy(X, 0.0); + // Test CGS2 + { + gmres(&kh, A, B, X); - gmres(&kh, A, B, X); + // Double check residuals at end of solve: + float_t nrmB = KokkosBlas::nrm2(B); + KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax + KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. + float_t endRes = KokkosBlas::nrm2(B) / nrmB; - // Double check residuals at end of solve: - float_t nrmB = KokkosBlas::nrm2(B); - KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax - KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. - float_t endRes = KokkosBlas::nrm2(B) / nrmB; + const auto conv_flag = gmres_handle->get_conv_flag_val(); - const auto conv_flag = gmres_handle->get_conv_flag_val(); + EXPECT_LT(endRes, gmres_handle->get_tol()); + EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); + } - EXPECT_LT(endRes, gmres_handle->get_tol()); - EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); - } + // Test MGS + { + gmres_handle->reset_handle(m, tol); + gmres_handle->set_ortho(GMRESHandle::Ortho::MGS); + gmres_handle->set_verbose(verbose); - // Test GSS2 with simple preconditioner - { - gmres_handle->reset_handle(m, tol); - gmres_handle->set_verbose(verbose); + // reset X for next gmres call + Kokkos::deep_copy(X, 0.0); + + gmres(&kh, A, B, X); + + // Double check residuals at end of solve: + float_t nrmB = KokkosBlas::nrm2(B); + KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax + KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. + float_t endRes = KokkosBlas::nrm2(B) / nrmB; + + const auto conv_flag = gmres_handle->get_conv_flag_val(); - // Make precond - KokkosSparse::Experimental::MatrixPrec myPrec(A); + EXPECT_LT(endRes, gmres_handle->get_tol()); + EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); + } - // reset X for next gmres call - Kokkos::deep_copy(X, 0.0); + // Test GSS2 with simple preconditioner + { + gmres_handle->reset_handle(m, tol); + gmres_handle->set_verbose(verbose); - gmres(&kh, A, B, X, &myPrec); + // Make precond + KokkosSparse::Experimental::MatrixPrec myPrec(A); - // Double check residuals at end of solve: - float_t nrmB = KokkosBlas::nrm2(B); - KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax - KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. - float_t endRes = KokkosBlas::nrm2(B) / nrmB; + // reset X for next gmres call + Kokkos::deep_copy(X, 0.0); - const auto conv_flag = gmres_handle->get_conv_flag_val(); + gmres(&kh, A, B, X, &myPrec); - EXPECT_LT(endRes, gmres_handle->get_tol()); - EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); + // Double check residuals at end of solve: + float_t nrmB = KokkosBlas::nrm2(B); + KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax + KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. + float_t endRes = KokkosBlas::nrm2(B) / nrmB; + + const auto conv_flag = gmres_handle->get_conv_flag_val(); + + EXPECT_LT(endRes, gmres_handle->get_tol()); + EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); + } } -} +}; } // namespace Test template void test_gmres() { - Test::run_test_gmres(); + using TestStruct = Test::GmresTest; + TestStruct::template run_test_gmres(); + TestStruct::template run_test_gmres(); } #define KOKKOSKERNELS_EXECUTE_TEST(SCALAR, ORDINAL, OFFSET, DEVICE) \ diff --git a/sparse/unit_test/Test_Sparse_par_ilut.hpp b/sparse/unit_test/Test_Sparse_par_ilut.hpp index 4370ebe37e..cda09d0639 100644 --- a/sparse/unit_test/Test_Sparse_par_ilut.hpp +++ b/sparse/unit_test/Test_Sparse_par_ilut.hpp @@ -29,6 +29,8 @@ #include "KokkosSparse_LUPrec.hpp" #include "KokkosSparse_SortCrs.hpp" +#include "Test_vector_fixtures.hpp" + #include using namespace KokkosSparse; @@ -52,69 +54,6 @@ struct TolMeta { } // namespace ParIlut -template -std::vector> decompress_matrix( - Kokkos::View& row_map, - Kokkos::View& entries, - Kokkos::View& values) { - const size_type nrows = row_map.size() - 1; - std::vector> result; - result.resize(nrows); - for (auto& row : result) { - row.resize(nrows, 0.0); - } - - auto hrow_map = Kokkos::create_mirror_view(row_map); - auto hentries = Kokkos::create_mirror_view(entries); - auto hvalues = Kokkos::create_mirror_view(values); - Kokkos::deep_copy(hrow_map, row_map); - Kokkos::deep_copy(hentries, entries); - Kokkos::deep_copy(hvalues, values); - - for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { - const size_type row_nnz_begin = hrow_map(row_idx); - const size_type row_nnz_end = hrow_map(row_idx + 1); - for (size_type row_nnz = row_nnz_begin; row_nnz < row_nnz_end; ++row_nnz) { - const lno_t col_idx = hentries(row_nnz); - const scalar_t value = hvalues(row_nnz); - result[row_idx][col_idx] = value; - } - } - - return result; -} - -template -void check_matrix(const std::string& name, - Kokkos::View& row_map, - Kokkos::View& entries, - Kokkos::View& values, - const std::vector>& expected) { - const auto decompressed_mtx = decompress_matrix(row_map, entries, values); - - const size_type nrows = row_map.size() - 1; - for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { - for (size_type col_idx = 0; col_idx < nrows; ++col_idx) { - EXPECT_NEAR(expected[row_idx][col_idx], - decompressed_mtx[row_idx][col_idx], 0.01) - << "Failed check is: " << name << "[" << row_idx << "][" << col_idx - << "]"; - } - } -} - -template -void print_matrix(const std::vector>& matrix) { - for (const auto& row : matrix) { - for (const auto& item : row) { - std::printf("%.2f ", item); - } - std::cout << std::endl; - } -} - template void run_test_par_ilut() { @@ -131,47 +70,14 @@ void run_test_par_ilut() { {0.5, -3., 6., 0.}, {0.2, -0.5, -9., 0.}}; - const scalar_t ZERO = scalar_t(0); - - const size_type nrows = A.size(); - - // Count A nnz's - size_type nnz = 0; - for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { - for (size_type col_idx = 0; col_idx < nrows; ++col_idx) { - if (A[row_idx][col_idx] != ZERO) { - ++nnz; - } - } - } - // Allocate device CRS views for A - RowMapType row_map("row_map", nrows + 1); - EntriesType entries("entries", nnz); - ValuesType values("values", nnz); - - // Create host mirror views for CRS A - auto hrow_map = Kokkos::create_mirror_view(row_map); - auto hentries = Kokkos::create_mirror_view(entries); - auto hvalues = Kokkos::create_mirror_view(values); + RowMapType row_map("row_map", 0); + EntriesType entries("entries", 0); + ValuesType values("values", 0); - // Compress A into CRS (host views) - size_type curr_nnz = 0; - for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { - for (size_type col_idx = 0; col_idx < nrows; ++col_idx) { - if (A[row_idx][col_idx] != ZERO) { - hentries(curr_nnz) = col_idx; - hvalues(curr_nnz) = A[row_idx][col_idx]; - ++curr_nnz; - } - hrow_map(row_idx + 1) = curr_nnz; - } - } + compress_matrix(row_map, entries, values, A); - // Copy host A CRS views to device A CRS views - Kokkos::deep_copy(row_map, hrow_map); - Kokkos::deep_copy(entries, hentries); - Kokkos::deep_copy(values, hvalues); + const size_type nrows = A.size(); // Make kernel handle KernelHandle kh; diff --git a/sparse/unit_test/Test_Sparse_spadd.hpp b/sparse/unit_test/Test_Sparse_spadd.hpp index 05ff97bb3a..3156801dbd 100644 --- a/sparse/unit_test/Test_Sparse_spadd.hpp +++ b/sparse/unit_test/Test_Sparse_spadd.hpp @@ -32,7 +32,11 @@ typedef Kokkos::complex kokkos_complex_double; typedef Kokkos::complex kokkos_complex_float; -// Create a random square matrix for testing mat-mat addition kernels +// Create a random nrows by ncols matrix for testing mat-mat addition kernels. +// minNNZ, maxNNZ: min and max number of nonzeros in any row. +// maxNNZ > ncols will result in duplicated entries in a row, otherwise entries +// in a row are unique. +// sortRows: whether to sort columns in a row template crsMat_t randomMatrix(ordinal_type nrows, ordinal_type ncols, ordinal_type minNNZ, ordinal_type maxNNZ, bool sortRows) { @@ -117,7 +121,9 @@ void test_spadd(lno_t numRows, lno_t numCols, size_type minNNZ, srand((numRows << 1) ^ numCols); KernelHandle handle; - handle.create_spadd_handle(sortRows); + // If maxNNZ <= numCols, the generated A, B have unique column indices in each + // row + handle.create_spadd_handle(sortRows, static_cast(maxNNZ) <= numCols); crsMat_t A = randomMatrix(numRows, numCols, minNNZ, maxNNZ, sortRows); crsMat_t B = @@ -129,9 +135,10 @@ void test_spadd(lno_t numRows, lno_t numCols, size_type minNNZ, // initialized Kokkos::deep_copy(c_row_map, (size_type)5); auto addHandle = handle.get_spadd_handle(); - KokkosSparse::Experimental::spadd_symbolic(&handle, A.graph.row_map, - A.graph.entries, B.graph.row_map, - B.graph.entries, c_row_map); + typename Device::execution_space exec{}; + KokkosSparse::Experimental::spadd_symbolic( + exec, &handle, numRows, numCols, A.graph.row_map, A.graph.entries, + B.graph.row_map, B.graph.entries, c_row_map); size_type c_nnz = addHandle->get_c_nnz(); // Fill values, entries with incorrect incorret values_type c_values( @@ -140,9 +147,9 @@ void test_spadd(lno_t numRows, lno_t numCols, size_type minNNZ, entries_type c_entries("C entries", c_nnz); Kokkos::deep_copy(c_entries, (lno_t)5); KokkosSparse::Experimental::spadd_numeric( - &handle, A.graph.row_map, A.graph.entries, A.values, KAT::one(), - B.graph.row_map, B.graph.entries, B.values, KAT::one(), c_row_map, - c_entries, c_values); + exec, &handle, numRows, numCols, A.graph.row_map, A.graph.entries, + A.values, KAT::one(), B.graph.row_map, B.graph.entries, B.values, + KAT::one(), c_row_map, c_entries, c_values); // done with handle // create C using CRS arrays crsMat_t C("C", numRows, numCols, c_nnz, c_values, c_row_map, c_entries); diff --git a/sparse/unit_test/Test_Sparse_spgemm.hpp b/sparse/unit_test/Test_Sparse_spgemm.hpp index 7e655d4c0c..bd1e68c370 100644 --- a/sparse/unit_test/Test_Sparse_spgemm.hpp +++ b/sparse/unit_test/Test_Sparse_spgemm.hpp @@ -486,16 +486,6 @@ void test_issue402() { template void test_issue1738() { -#if defined(KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) && (CUDA_VERSION >= 11000) && \ - (CUDA_VERSION < 11040) - { - std::cerr - << "TEST SKIPPED: See " - "https://github.com/kokkos/kokkos-kernels/issues/1777 for details." - << std::endl; - return; - } -#endif // KOKKOSKERNELS_ENABLE_TPL_ARMPL // Make sure that std::invalid_argument is thrown if you: // - call numeric where an input matrix's entries have changed. // - try to reuse an spgemm handle by calling symbolic with new input diff --git a/sparse/unit_test/Test_Sparse_spiluk.hpp b/sparse/unit_test/Test_Sparse_spiluk.hpp index 77cdb1ede1..2a8398ed46 100644 --- a/sparse/unit_test/Test_Sparse_spiluk.hpp +++ b/sparse/unit_test/Test_Sparse_spiluk.hpp @@ -26,161 +26,139 @@ #include "KokkosBlas1_nrm2.hpp" #include "KokkosSparse_spmv.hpp" #include "KokkosSparse_spiluk.hpp" +#include "KokkosSparse_crs_to_bsr_impl.hpp" +#include "KokkosSparse_bsr_to_crs_impl.hpp" +#include "KokkosSparse_LUPrec.hpp" +#include "KokkosSparse_gmres.hpp" -#include +#include "Test_vector_fixtures.hpp" + +#include +#include using namespace KokkosSparse; using namespace KokkosSparse::Experimental; using namespace KokkosKernels; using namespace KokkosKernels::Experimental; -// #ifndef kokkos_complex_double -// #define kokkos_complex_double Kokkos::complex -// #define kokkos_complex_float Kokkos::complex -// #endif +using kokkos_complex_double = Kokkos::complex; +using kokkos_complex_float = Kokkos::complex; + +// Comment this out to do focussed debugging +#define TEST_SPILUK_FULL_CHECKS -typedef Kokkos::complex kokkos_complex_double; -typedef Kokkos::complex kokkos_complex_float; +// Test verbosity level. 0 = none, 1 = print residuals, 2 = print L,U +#define TEST_SPILUK_VERBOSE_LEVEL 0 + +// #define TEST_SPILUK_TINY_TEST namespace Test { -template -void run_test_spiluk() { - typedef Kokkos::View RowMapType; - typedef Kokkos::View EntriesType; - typedef Kokkos::View ValuesType; - typedef Kokkos::ArithTraits AT; - - const size_type nrows = 9; - const size_type nnz = 21; - - RowMapType row_map("row_map", nrows + 1); - EntriesType entries("entries", nnz); - ValuesType values("values", nnz); - - auto hrow_map = Kokkos::create_mirror_view(row_map); - auto hentries = Kokkos::create_mirror_view(entries); - auto hvalues = Kokkos::create_mirror_view(values); - - scalar_t ZERO = scalar_t(0); - scalar_t ONE = scalar_t(1); - scalar_t MONE = scalar_t(-1); - - hrow_map(0) = 0; - hrow_map(1) = 3; - hrow_map(2) = 5; - hrow_map(3) = 6; - hrow_map(4) = 9; - hrow_map(5) = 11; - hrow_map(6) = 13; - hrow_map(7) = 15; - hrow_map(8) = 18; - hrow_map(9) = nnz; - - hentries(0) = 0; - hentries(1) = 2; - hentries(2) = 5; - hentries(3) = 1; - hentries(4) = 6; - hentries(5) = 2; - hentries(6) = 0; - hentries(7) = 3; - hentries(8) = 4; - hentries(9) = 0; - hentries(10) = 4; - hentries(11) = 1; - hentries(12) = 5; - hentries(13) = 2; - hentries(14) = 6; - hentries(15) = 3; - hentries(16) = 4; - hentries(17) = 7; - hentries(18) = 3; - hentries(19) = 4; - hentries(20) = 8; - - hvalues(0) = 10; - hvalues(1) = 0.3; - hvalues(2) = 0.6; - hvalues(3) = 11; - hvalues(4) = 0.7; - hvalues(5) = 12; - hvalues(6) = 5; - hvalues(7) = 13; - hvalues(8) = 1; - hvalues(9) = 4; - hvalues(10) = 14; - hvalues(11) = 3; - hvalues(12) = 15; - hvalues(13) = 7; - hvalues(14) = 16; - hvalues(15) = 6; - hvalues(16) = 5; - hvalues(17) = 17; - hvalues(18) = 2; - hvalues(19) = 2.5; - hvalues(20) = 18; - - Kokkos::deep_copy(row_map, hrow_map); - Kokkos::deep_copy(entries, hentries); - Kokkos::deep_copy(values, hvalues); - - typedef KokkosKernels::Experimental::KokkosKernelsHandle< - size_type, lno_t, scalar_t, typename device::execution_space, - typename device::memory_space, typename device::memory_space> - KernelHandle; - - KernelHandle kh; - - // SPILUKAlgorithm::SEQLVLSCHD_RP - { - kh.create_spiluk_handle(SPILUKAlgorithm::SEQLVLSCHD_RP, nrows, 4 * nrows, - 4 * nrows); +#ifdef TEST_SPILUK_TINY_TEST +template +std::vector> get_fixture() { + std::vector> A = {{10.00, 1.00, 0.00, 0.00}, + {0.00, 11.00, 0.00, 0.00}, + {0.00, 2.00, 12.00, 0.00}, + {5.00, 0.00, 3.00, 13.00}}; + return A; +} +#else +template +std::vector> get_fixture() { + std::vector> A = { + {10.00, 0.00, 0.30, 0.00, 0.00, 0.60, 0.00, 0.00, 0.00}, + {0.00, 11.00, 0.00, 0.00, 0.00, 0.00, 0.70, 0.00, 0.00}, + {0.00, 0.00, 12.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00}, + {5.00, 0.00, 0.00, 13.00, 1.00, 0.00, 0.00, 0.00, 0.00}, + {4.00, 0.00, 0.00, 0.00, 14.00, 0.00, 0.00, 0.00, 0.00}, + {0.00, 3.00, 0.00, 0.00, 0.00, 15.00, 0.00, 0.00, 0.00}, + {0.00, 0.00, 7.00, 0.00, 0.00, 0.00, 16.00, 0.00, 0.00}, + {0.00, 0.00, 0.00, 6.00, 5.00, 0.00, 0.00, 17.00, 0.00}, + {0.00, 0.00, 0.00, 2.00, 2.50, 0.00, 0.00, 0.00, 18.00}}; + return A; +} +#endif - auto spiluk_handle = kh.get_spiluk_handle(); +template < + typename MatrixType, typename CRS, + typename std::enable_if::value>::type* = nullptr> +MatrixType get_A(CRS A_unblocked, const size_t) { + return A_unblocked; +} - // Allocate L and U as outputs - RowMapType L_row_map("L_row_map", nrows + 1); - EntriesType L_entries("L_entries", spiluk_handle->get_nnzL()); - ValuesType L_values("L_values", spiluk_handle->get_nnzL()); - RowMapType U_row_map("U_row_map", nrows + 1); - EntriesType U_entries("U_entries", spiluk_handle->get_nnzU()); - ValuesType U_values("U_values", spiluk_handle->get_nnzU()); +template < + typename MatrixType, typename CRS, + typename std::enable_if::value>::type* = nullptr> +MatrixType get_A(CRS A_unblocked, const size_t block_size) { + // Convert to BSR + MatrixType A(A_unblocked, block_size); - typename KernelHandle::const_nnz_lno_t fill_lev = 2; + return A; +} - spiluk_symbolic(&kh, fill_lev, row_map, entries, L_row_map, L_entries, - U_row_map, U_entries); +template < + typename MatrixType, typename RowMapType, typename EntriesType, + typename ValuesType, + typename std::enable_if::value>::type* = nullptr> +MatrixType make_matrix(const char* name, const RowMapType& row_map, + const EntriesType& entries, const ValuesType& values, + const size_t) { + const auto nrows = row_map.extent(0) - 1; + return MatrixType(name, nrows, nrows, values.extent(0), values, row_map, + entries); +} - Kokkos::fence(); +template < + typename MatrixType, typename RowMapType, typename EntriesType, + typename ValuesType, + typename std::enable_if::value>::type* = nullptr> +MatrixType make_matrix(const char* name, const RowMapType& row_map, + const EntriesType& entries, const ValuesType& values, + const size_t block_size) { + const auto nrows = row_map.extent(0) - 1; + return MatrixType(name, nrows, nrows, values.extent(0), values, row_map, + entries, block_size); +} - Kokkos::resize(L_entries, spiluk_handle->get_nnzL()); - Kokkos::resize(L_values, spiluk_handle->get_nnzL()); - Kokkos::resize(U_entries, spiluk_handle->get_nnzU()); - Kokkos::resize(U_values, spiluk_handle->get_nnzU()); +static constexpr double EPS = 1e-7; - spiluk_handle->print_algorithm(); - spiluk_numeric(&kh, fill_lev, row_map, entries, values, L_row_map, - L_entries, L_values, U_row_map, U_entries, U_values); +template +struct SpilukTest { + using RowMapType = Kokkos::View; + using EntriesType = Kokkos::View; + using ValuesType = Kokkos::View; + using AT = Kokkos::ArithTraits; - Kokkos::fence(); + using RowMapType_hostmirror = typename RowMapType::HostMirror; + using EntriesType_hostmirror = typename EntriesType::HostMirror; + using ValuesType_hostmirror = typename ValuesType::HostMirror; + using execution_space = typename device::execution_space; + using memory_space = typename device::memory_space; + using range_policy = Kokkos::RangePolicy; - // Checking - typedef CrsMatrix crsMat_t; - crsMat_t A("A_Mtx", nrows, nrows, nnz, values, row_map, entries); - crsMat_t L("L_Mtx", nrows, nrows, spiluk_handle->get_nnzL(), L_values, - L_row_map, L_entries); - crsMat_t U("U_Mtx", nrows, nrows, spiluk_handle->get_nnzU(), U_values, - U_row_map, U_entries); + using KernelHandle = KokkosKernels::Experimental::KokkosKernelsHandle< + size_type, lno_t, scalar_t, execution_space, memory_space, memory_space>; + + using Crs = CrsMatrix; + using Bsr = BsrMatrix; + + template + static typename AT::mag_type check_result_impl( + const AType& A, const LType& L, const UType& U, const size_type nrows, + const size_type block_size = 1) { + const scalar_t ZERO = scalar_t(0); + const scalar_t ONE = scalar_t(1); + const scalar_t MONE = scalar_t(-1); // Create a reference view e set to all 1's - ValuesType e_one("e_one", nrows); - Kokkos::deep_copy(e_one, 1.0); + ValuesType e_one("e_one", nrows * block_size); + Kokkos::deep_copy(e_one, ONE); // Create two views for spmv results - ValuesType bb("bb", nrows); - ValuesType bb_tmp("bb_tmp", nrows); + ValuesType bb("bb", nrows * block_size); + ValuesType bb_tmp("bb_tmp", nrows * block_size); // Compute norm2(L*U*e_one - A*e_one)/norm2(A*e_one) KokkosSparse::spmv("N", ONE, A, e_one, ZERO, bb); @@ -192,27 +170,111 @@ void run_test_spiluk() { typename AT::mag_type diff_nrm = KokkosBlas::nrm2(bb); - EXPECT_TRUE((diff_nrm / bb_nrm) < 1e-4); + return diff_nrm / bb_nrm; + } - kh.destroy_spiluk_handle(); + static bool is_triangular(const RowMapType& drow_map, + const EntriesType& dentries, bool check_lower) { + const size_type nrows = drow_map.extent(0) - 1; + + auto row_map = Kokkos::create_mirror_view(drow_map); + auto entries = Kokkos::create_mirror_view(dentries); + Kokkos::deep_copy(row_map, drow_map); + Kokkos::deep_copy(entries, dentries); + + for (size_type row = 0; row < nrows; ++row) { + const size_type row_nnz_begin = row_map(row); + const size_type row_nnz_end = row_map(row + 1); + for (size_type nnz = row_nnz_begin; nnz < row_nnz_end; ++nnz) { + const size_type col = entries(nnz); + if (col > row && check_lower) { + return false; + } else if (col < row && !check_lower) { + return false; + } + } + } + return true; } - // SPILUKAlgorithm::SEQLVLSCHD_TP1 - { - kh.create_spiluk_handle(SPILUKAlgorithm::SEQLVLSCHD_TP1, nrows, 4 * nrows, - 4 * nrows); + template + static void check_result(const RowMapType& row_map, + const EntriesType& entries, const ValuesType& values, + const RowMapType& L_row_map, + const EntriesType& L_entries, + const ValuesType& L_values, + const RowMapType& U_row_map, + const EntriesType& U_entries, + const ValuesType& U_values, const lno_t fill_lev, + const size_type block_size = 1) { + using sp_matrix_type = std::conditional_t; + + KK_REQUIRE(UseBlocks || (block_size == 1)); + + // Checking + const auto nrows = row_map.extent(0) - 1; + auto A = make_matrix("A_Mtx", row_map, entries, values, + block_size); + auto L = make_matrix("L_Mtx", L_row_map, L_entries, + L_values, block_size); + auto U = make_matrix("U_Mtx", U_row_map, U_entries, + U_values, block_size); + + EXPECT_TRUE(is_triangular(L_row_map, L_entries, true)); + EXPECT_TRUE(is_triangular(U_row_map, U_entries, false)); + + const auto result = check_result_impl(A, L, U, nrows, block_size); + if (TEST_SPILUK_VERBOSE_LEVEL > 0) { + std::cout << "For nrows=" << nrows << ", fill_level=" << fill_lev; + if (UseBlocks) { + std::cout << ", block_size=" << block_size; + } else { + std::cout << ", unblocked"; + } + std::cout << " had residual: " << result << std::endl; + } + if (TEST_SPILUK_VERBOSE_LEVEL > 1) { + std::cout << "L result" << std::endl; + print_matrix( + decompress_matrix(L_row_map, L_entries, L_values, block_size)); + std::cout << "U result" << std::endl; + print_matrix( + decompress_matrix(U_row_map, U_entries, U_values, block_size)); + } + + if (fill_lev > 1) { + if (UseBlocks) { + EXPECT_LT(result, 1e-2); + } else { + EXPECT_LT(result, 1e-4); + } + } + } + + template + static std::tuple + run_and_check_spiluk(KernelHandle& kh, const RowMapType& row_map, + const EntriesType& entries, const ValuesType& values, + SPILUKAlgorithm alg, const lno_t fill_lev, + const size_type block_size = 1) { + KK_REQUIRE(UseBlocks || (block_size == 1)); + + const size_type block_items = block_size * block_size; + const size_type nrows = row_map.extent(0) - 1; + kh.create_spiluk_handle(alg, nrows, 40 * nrows, 40 * nrows, + !UseBlocks ? 0 : block_size); auto spiluk_handle = kh.get_spiluk_handle(); + if (TeamSize != -1) { + spiluk_handle->set_team_size(TeamSize); + } // Allocate L and U as outputs RowMapType L_row_map("L_row_map", nrows + 1); EntriesType L_entries("L_entries", spiluk_handle->get_nnzL()); - ValuesType L_values("L_values", spiluk_handle->get_nnzL()); RowMapType U_row_map("U_row_map", nrows + 1); EntriesType U_entries("U_entries", spiluk_handle->get_nnzU()); - ValuesType U_values("U_values", spiluk_handle->get_nnzU()); - - typename KernelHandle::const_nnz_lno_t fill_lev = 2; spiluk_symbolic(&kh, fill_lev, row_map, entries, L_row_map, L_entries, U_row_map, U_entries); @@ -220,292 +282,609 @@ void run_test_spiluk() { Kokkos::fence(); Kokkos::resize(L_entries, spiluk_handle->get_nnzL()); - Kokkos::resize(L_values, spiluk_handle->get_nnzL()); Kokkos::resize(U_entries, spiluk_handle->get_nnzU()); - Kokkos::resize(U_values, spiluk_handle->get_nnzU()); + ValuesType L_values("L_values", spiluk_handle->get_nnzL() * block_items); + ValuesType U_values("U_values", spiluk_handle->get_nnzU() * block_items); - spiluk_handle->print_algorithm(); spiluk_numeric(&kh, fill_lev, row_map, entries, values, L_row_map, L_entries, L_values, U_row_map, U_entries, U_values); Kokkos::fence(); - // Checking - typedef CrsMatrix crsMat_t; - crsMat_t A("A_Mtx", nrows, nrows, nnz, values, row_map, entries); - crsMat_t L("L_Mtx", nrows, nrows, spiluk_handle->get_nnzL(), L_values, - L_row_map, L_entries); - crsMat_t U("U_Mtx", nrows, nrows, spiluk_handle->get_nnzU(), U_values, - U_row_map, U_entries); + check_result(row_map, entries, values, L_row_map, L_entries, + L_values, U_row_map, U_entries, U_values, fill_lev, + block_size); - // Create a reference view e set to all 1's - ValuesType e_one("e_one", nrows); - Kokkos::deep_copy(e_one, 1.0); + kh.destroy_spiluk_handle(); - // Create two views for spmv results - ValuesType bb("bb", nrows); - ValuesType bb_tmp("bb_tmp", nrows); +#ifdef TEST_SPILUK_FULL_CHECKS + // If block_size is 1, results should exactly match unblocked results + if (block_size == 1 && UseBlocks) { + const auto [L_row_map_u, L_entries_u, L_values_u, U_row_map_u, + U_entries_u, U_values_u] = + run_and_check_spiluk(kh, row_map, entries, values, + alg, fill_lev); + + EXPECT_NEAR_KK_1DVIEW(L_row_map, L_row_map_u, EPS); + EXPECT_NEAR_KK_1DVIEW(L_entries, L_entries_u, EPS); + EXPECT_NEAR_KK_1DVIEW(L_values, L_values_u, EPS); + EXPECT_NEAR_KK_1DVIEW(U_row_map, U_row_map_u, EPS); + EXPECT_NEAR_KK_1DVIEW(U_entries, U_entries_u, EPS); + EXPECT_NEAR_KK_1DVIEW(U_values, U_values_u, EPS); + } - // Compute norm2(L*U*e_one - A*e_one)/norm2(A*e_one) - KokkosSparse::spmv("N", ONE, A, e_one, ZERO, bb); + // Check that team size = 1 produces same result + if (TeamSize != 1) { + const auto [L_row_map_ts1, L_entries_ts1, L_values_ts1, U_row_map_ts1, + U_entries_ts1, U_values_ts1] = + run_and_check_spiluk(kh, row_map, entries, values, alg, + fill_lev, block_size); + + EXPECT_NEAR_KK_1DVIEW(L_row_map, L_row_map_ts1, EPS); + EXPECT_NEAR_KK_1DVIEW(L_entries, L_entries_ts1, EPS); + EXPECT_NEAR_KK_1DVIEW(L_values, L_values_ts1, EPS); + EXPECT_NEAR_KK_1DVIEW(U_row_map, U_row_map_ts1, EPS); + EXPECT_NEAR_KK_1DVIEW(U_entries, U_entries_ts1, EPS); + EXPECT_NEAR_KK_1DVIEW(U_values, U_values_ts1, EPS); + } +#endif - typename AT::mag_type bb_nrm = KokkosBlas::nrm2(bb); + return std::make_tuple(L_row_map, L_entries, L_values, U_row_map, U_entries, + U_values); + } - KokkosSparse::spmv("N", ONE, U, e_one, ZERO, bb_tmp); - KokkosSparse::spmv("N", ONE, L, bb_tmp, MONE, bb); + static void run_test_spiluk() { + std::vector> A = get_fixture(); - typename AT::mag_type diff_nrm = KokkosBlas::nrm2(bb); + if (TEST_SPILUK_VERBOSE_LEVEL > 1) { + std::cout << "A input" << std::endl; + print_matrix(A); + } - EXPECT_TRUE((diff_nrm / bb_nrm) < 1e-4); + RowMapType row_map; + EntriesType entries; + ValuesType values; - kh.destroy_spiluk_handle(); + compress_matrix(row_map, entries, values, A); + + const lno_t fill_lev = 2; + + KernelHandle kh; + + run_and_check_spiluk(kh, row_map, entries, values, + SPILUKAlgorithm::SEQLVLSCHD_TP1, fill_lev); } -} -template -void run_test_spiluk_streams(int test_algo, int nstreams) { - using RowMapType = Kokkos::View; - using EntriesType = Kokkos::View; - using ValuesType = Kokkos::View; - using RowMapType_hostmirror = typename RowMapType::HostMirror; - using EntriesType_hostmirror = typename EntriesType::HostMirror; - using ValuesType_hostmirror = typename ValuesType::HostMirror; - using execution_space = typename device::execution_space; - using memory_space = typename device::memory_space; - using KernelHandle = KokkosKernels::Experimental::KokkosKernelsHandle< - size_type, lno_t, scalar_t, execution_space, memory_space, memory_space>; - using crsMat_t = CrsMatrix; - using AT = Kokkos::ArithTraits; + static void run_test_spiluk_blocks() { + std::vector> A = get_fixture(); - // Workaround for OpenMP: skip tests if concurrency < nstreams because of - // not enough resource to partition - bool run_streams_test = true; -#ifdef KOKKOS_ENABLE_OPENMP - if (std::is_same::value) { - int exec_concurrency = execution_space().concurrency(); - if (exec_concurrency < nstreams) { - run_streams_test = false; - std::cout << " Skip stream test: concurrency = " << exec_concurrency - << std::endl; + if (TEST_SPILUK_VERBOSE_LEVEL > 1) { + std::cout << "A input" << std::endl; + print_matrix(A); + } + + RowMapType row_map, brow_map; + EntriesType entries, bentries; + ValuesType values, bvalues; + + compress_matrix(row_map, entries, values, A); + + const size_type nrows = A.size(); + const size_type nnz = values.extent(0); + const lno_t fill_lev = 2; + const size_type block_size = nrows % 2 == 0 ? 2 : 3; + ASSERT_EQ(nrows % block_size, 0); + + KernelHandle kh; + + Crs crs("crs for block spiluk test", nrows, nrows, nnz, values, row_map, + entries); + + std::vector block_sizes = {1, block_size}; + + for (auto block_size_itr : block_sizes) { + Bsr bsr(crs, block_size_itr); + + // Pull out views from BSR + Kokkos::resize(brow_map, bsr.graph.row_map.extent(0)); + Kokkos::resize(bentries, bsr.graph.entries.extent(0)); + Kokkos::resize(bvalues, bsr.values.extent(0)); + Kokkos::deep_copy(brow_map, bsr.graph.row_map); + Kokkos::deep_copy(bentries, bsr.graph.entries); + Kokkos::deep_copy(bvalues, bsr.values); + + run_and_check_spiluk(kh, brow_map, bentries, bvalues, + SPILUKAlgorithm::SEQLVLSCHD_TP1, fill_lev, + block_size_itr); + } + } + + static void run_test_spiluk_scale() { + // Create a diagonally dominant sparse matrix to test: + constexpr auto nrows = 5000; + constexpr auto diagDominance = 2; + + size_type nnz = 10 * nrows; + auto A = + KokkosSparse::Impl::kk_generate_diagonally_dominant_sparse_matrix( + nrows, nrows, nnz, 0, lno_t(0.01 * nrows), diagDominance); + + KokkosSparse::sort_crs_matrix(A); + + // Pull out views from CRS + RowMapType row_map("row_map", A.graph.row_map.extent(0)); + EntriesType entries("entries", A.graph.entries.extent(0)); + ValuesType values("values", A.values.extent(0)); + Kokkos::deep_copy(row_map, A.graph.row_map); + Kokkos::deep_copy(entries, A.graph.entries); + Kokkos::deep_copy(values, A.values); + + for (lno_t fill_lev = 0; fill_lev < 4; ++fill_lev) { + KernelHandle kh; + + run_and_check_spiluk(kh, row_map, entries, values, + SPILUKAlgorithm::SEQLVLSCHD_TP1, fill_lev); + } + } + + static void run_test_spiluk_scale_blocks() { + // Create a diagonally dominant sparse matrix to test: + constexpr auto nrows = 5000; + constexpr auto diagDominance = 2; + + RowMapType brow_map; + EntriesType bentries; + ValuesType bvalues; + + // const size_type block_size = 10; + + size_type nnz = 10 * nrows; + auto A = + KokkosSparse::Impl::kk_generate_diagonally_dominant_sparse_matrix( + nrows, nrows, nnz, 0, lno_t(0.01 * nrows), diagDominance); + + KokkosSparse::sort_crs_matrix(A); + + std::vector block_sizes = {1, 2, 4, 10}; + + for (auto block_size : block_sizes) { + // Convert to BSR + Bsr bsr(A, block_size); + + // Pull out views from BSR + Kokkos::resize(brow_map, bsr.graph.row_map.extent(0)); + Kokkos::resize(bentries, bsr.graph.entries.extent(0)); + Kokkos::resize(bvalues, bsr.values.extent(0)); + Kokkos::deep_copy(brow_map, bsr.graph.row_map); + Kokkos::deep_copy(bentries, bsr.graph.entries); + Kokkos::deep_copy(bvalues, bsr.values); + + for (lno_t fill_lev = 0; fill_lev < 4; ++fill_lev) { + KernelHandle kh; + + run_and_check_spiluk(kh, brow_map, bentries, bvalues, + SPILUKAlgorithm::SEQLVLSCHD_TP1, fill_lev, + block_size); + } } } + + static void run_test_spiluk_streams(SPILUKAlgorithm test_algo, int nstreams) { + // Workaround for OpenMP: skip tests if concurrency < nstreams because of + // not enough resource to partition + bool run_streams_test = true; +#ifdef KOKKOS_ENABLE_OPENMP + if (std::is_same::value) { + int exec_concurrency = execution_space().concurrency(); + if (exec_concurrency < nstreams) { + run_streams_test = false; + std::cout << " Skip stream test: concurrency = " << exec_concurrency + << std::endl; + } + } #endif - if (!run_streams_test) return; - - const size_type nrows = 9; - const size_type nnz = 21; - - std::vector instances; - if (nstreams == 1) - instances = Kokkos::Experimental::partition_space(execution_space(), 1); - else if (nstreams == 2) - instances = Kokkos::Experimental::partition_space(execution_space(), 1, 1); - else if (nstreams == 3) - instances = - Kokkos::Experimental::partition_space(execution_space(), 1, 1, 1); - else - instances = - Kokkos::Experimental::partition_space(execution_space(), 1, 1, 1, 1); - - std::vector kh_v(nstreams); - std::vector kh_ptr_v(nstreams); - std::vector A_row_map_v(nstreams); - std::vector A_entries_v(nstreams); - std::vector A_values_v(nstreams); - std::vector L_row_map_v(nstreams); - std::vector L_entries_v(nstreams); - std::vector L_values_v(nstreams); - std::vector U_row_map_v(nstreams); - std::vector U_entries_v(nstreams); - std::vector U_values_v(nstreams); - - RowMapType_hostmirror hrow_map("hrow_map", nrows + 1); - EntriesType_hostmirror hentries("hentries", nnz); - ValuesType_hostmirror hvalues("hvalues", nnz); - - scalar_t ZERO = scalar_t(0); - scalar_t ONE = scalar_t(1); - scalar_t MONE = scalar_t(-1); - - hrow_map(0) = 0; - hrow_map(1) = 3; - hrow_map(2) = 5; - hrow_map(3) = 6; - hrow_map(4) = 9; - hrow_map(5) = 11; - hrow_map(6) = 13; - hrow_map(7) = 15; - hrow_map(8) = 18; - hrow_map(9) = nnz; - - hentries(0) = 0; - hentries(1) = 2; - hentries(2) = 5; - hentries(3) = 1; - hentries(4) = 6; - hentries(5) = 2; - hentries(6) = 0; - hentries(7) = 3; - hentries(8) = 4; - hentries(9) = 0; - hentries(10) = 4; - hentries(11) = 1; - hentries(12) = 5; - hentries(13) = 2; - hentries(14) = 6; - hentries(15) = 3; - hentries(16) = 4; - hentries(17) = 7; - hentries(18) = 3; - hentries(19) = 4; - hentries(20) = 8; - - hvalues(0) = 10; - hvalues(1) = 0.3; - hvalues(2) = 0.6; - hvalues(3) = 11; - hvalues(4) = 0.7; - hvalues(5) = 12; - hvalues(6) = 5; - hvalues(7) = 13; - hvalues(8) = 1; - hvalues(9) = 4; - hvalues(10) = 14; - hvalues(11) = 3; - hvalues(12) = 15; - hvalues(13) = 7; - hvalues(14) = 16; - hvalues(15) = 6; - hvalues(16) = 5; - hvalues(17) = 17; - hvalues(18) = 2; - hvalues(19) = 2.5; - hvalues(20) = 18; - - typename KernelHandle::const_nnz_lno_t fill_lev = 2; - - for (int i = 0; i < nstreams; i++) { - // Allocate A as input - A_row_map_v[i] = RowMapType("A_row_map", nrows + 1); - A_entries_v[i] = EntriesType("A_entries", nnz); - A_values_v[i] = ValuesType("A_values", nnz); - - // Copy from host to device - Kokkos::deep_copy(A_row_map_v[i], hrow_map); - Kokkos::deep_copy(A_entries_v[i], hentries); - Kokkos::deep_copy(A_values_v[i], hvalues); - - // Create handle - kh_v[i] = KernelHandle(); - if (test_algo == 0) - kh_v[i].create_spiluk_handle(SPILUKAlgorithm::SEQLVLSCHD_RP, nrows, - 4 * nrows, 4 * nrows); - else if (test_algo == 1) - kh_v[i].create_spiluk_handle(SPILUKAlgorithm::SEQLVLSCHD_TP1, nrows, - 4 * nrows, 4 * nrows); - kh_ptr_v[i] = &kh_v[i]; - - auto spiluk_handle = kh_v[i].get_spiluk_handle(); - std::cout << " Stream " << i << ": "; - spiluk_handle->print_algorithm(); + if (!run_streams_test) return; - // Allocate L and U as outputs - L_row_map_v[i] = RowMapType("L_row_map", nrows + 1); - L_entries_v[i] = EntriesType("L_entries", spiluk_handle->get_nnzL()); - L_values_v[i] = ValuesType("L_values", spiluk_handle->get_nnzL()); - U_row_map_v[i] = RowMapType("U_row_map", nrows + 1); - U_entries_v[i] = EntriesType("U_entries", spiluk_handle->get_nnzU()); - U_values_v[i] = ValuesType("U_values", spiluk_handle->get_nnzU()); - - // Symbolic phase - spiluk_symbolic(kh_ptr_v[i], fill_lev, A_row_map_v[i], A_entries_v[i], - L_row_map_v[i], L_entries_v[i], U_row_map_v[i], - U_entries_v[i], nstreams); + std::vector weights(nstreams, 1); + std::vector instances = + Kokkos::Experimental::partition_space(execution_space(), weights); - Kokkos::fence(); + std::vector kh_v(nstreams); + std::vector kh_ptr_v(nstreams); + std::vector A_row_map_v(nstreams); + std::vector A_entries_v(nstreams); + std::vector A_values_v(nstreams); + std::vector L_row_map_v(nstreams); + std::vector L_entries_v(nstreams); + std::vector L_values_v(nstreams); + std::vector U_row_map_v(nstreams); + std::vector U_entries_v(nstreams); + std::vector U_values_v(nstreams); - Kokkos::resize(L_entries_v[i], spiluk_handle->get_nnzL()); - Kokkos::resize(L_values_v[i], spiluk_handle->get_nnzL()); - Kokkos::resize(U_entries_v[i], spiluk_handle->get_nnzU()); - Kokkos::resize(U_values_v[i], spiluk_handle->get_nnzU()); - } // Done handle creation and spiluk_symbolic on all streams - - // Numeric phase - spiluk_numeric_streams(instances, kh_ptr_v, fill_lev, A_row_map_v, - A_entries_v, A_values_v, L_row_map_v, L_entries_v, - L_values_v, U_row_map_v, U_entries_v, U_values_v); - - for (int i = 0; i < nstreams; i++) instances[i].fence(); - - // Checking - for (int i = 0; i < nstreams; i++) { - auto spiluk_handle = kh_v[i].get_spiluk_handle(); - crsMat_t A("A_Mtx", nrows, nrows, nnz, A_values_v[i], A_row_map_v[i], - A_entries_v[i]); - crsMat_t L("L_Mtx", nrows, nrows, spiluk_handle->get_nnzL(), L_values_v[i], - L_row_map_v[i], L_entries_v[i]); - crsMat_t U("U_Mtx", nrows, nrows, spiluk_handle->get_nnzU(), U_values_v[i], - U_row_map_v[i], U_entries_v[i]); + std::vector> Afix = get_fixture(); - // Create a reference view e set to all 1's - ValuesType e_one("e_one", nrows); - Kokkos::deep_copy(e_one, 1.0); + RowMapType row_map; + EntriesType entries; + ValuesType values; - // Create two views for spmv results - ValuesType bb("bb", nrows); - ValuesType bb_tmp("bb_tmp", nrows); + compress_matrix(row_map, entries, values, Afix); - // Compute norm2(L*U*e_one - A*e_one)/norm2(A*e_one) - KokkosSparse::spmv("N", ONE, A, e_one, ZERO, bb); + const size_type nrows = Afix.size(); + const size_type nnz = values.extent(0); - typename AT::mag_type bb_nrm = KokkosBlas::nrm2(bb); + RowMapType_hostmirror hrow_map("hrow_map", nrows + 1); + EntriesType_hostmirror hentries("hentries", nnz); + ValuesType_hostmirror hvalues("hvalues", nnz); - KokkosSparse::spmv("N", ONE, U, e_one, ZERO, bb_tmp); - KokkosSparse::spmv("N", ONE, L, bb_tmp, MONE, bb); + Kokkos::deep_copy(hrow_map, row_map); + Kokkos::deep_copy(hentries, entries); + Kokkos::deep_copy(hvalues, values); - typename AT::mag_type diff_nrm = KokkosBlas::nrm2(bb); + typename KernelHandle::const_nnz_lno_t fill_lev = 2; + + for (int i = 0; i < nstreams; i++) { + // Allocate A as input + A_row_map_v[i] = RowMapType("A_row_map", nrows + 1); + A_entries_v[i] = EntriesType("A_entries", nnz); + A_values_v[i] = ValuesType("A_values", nnz); + + // Copy from host to device + Kokkos::deep_copy(A_row_map_v[i], hrow_map); + Kokkos::deep_copy(A_entries_v[i], hentries); + Kokkos::deep_copy(A_values_v[i], hvalues); + + // Create handle + kh_v[i] = KernelHandle(); + kh_v[i].create_spiluk_handle(test_algo, nrows, 4 * nrows, 4 * nrows); + kh_ptr_v[i] = &kh_v[i]; + + auto spiluk_handle = kh_v[i].get_spiluk_handle(); + + // Allocate L and U as outputs + L_row_map_v[i] = RowMapType("L_row_map", nrows + 1); + L_entries_v[i] = EntriesType("L_entries", spiluk_handle->get_nnzL()); + U_row_map_v[i] = RowMapType("U_row_map", nrows + 1); + U_entries_v[i] = EntriesType("U_entries", spiluk_handle->get_nnzU()); + + // Symbolic phase + spiluk_symbolic(kh_ptr_v[i], fill_lev, A_row_map_v[i], A_entries_v[i], + L_row_map_v[i], L_entries_v[i], U_row_map_v[i], + U_entries_v[i], nstreams); + + Kokkos::fence(); - EXPECT_TRUE((diff_nrm / bb_nrm) < 1e-4); + Kokkos::resize(L_entries_v[i], spiluk_handle->get_nnzL()); + Kokkos::resize(U_entries_v[i], spiluk_handle->get_nnzU()); + L_values_v[i] = ValuesType("L_values", spiluk_handle->get_nnzL()); + U_values_v[i] = ValuesType("U_values", spiluk_handle->get_nnzU()); + } // Done handle creation and spiluk_symbolic on all streams - kh_v[i].destroy_spiluk_handle(); + // Numeric phase + spiluk_numeric_streams(instances, kh_ptr_v, fill_lev, A_row_map_v, + A_entries_v, A_values_v, L_row_map_v, L_entries_v, + L_values_v, U_row_map_v, U_entries_v, U_values_v); + + for (int i = 0; i < nstreams; i++) instances[i].fence(); + + // Checking + for (int i = 0; i < nstreams; i++) { + check_result(A_row_map_v[i], A_entries_v[i], A_values_v[i], + L_row_map_v[i], L_entries_v[i], L_values_v[i], + U_row_map_v[i], U_entries_v[i], U_values_v[i], + fill_lev); + + kh_v[i].destroy_spiluk_handle(); + } } -} + + static void run_test_spiluk_streams_blocks(SPILUKAlgorithm test_algo, + int nstreams) { + // Workaround for OpenMP: skip tests if concurrency < nstreams because of + // not enough resource to partition + bool run_streams_test = true; +#ifdef KOKKOS_ENABLE_OPENMP + if (std::is_same::value) { + int exec_concurrency = execution_space().concurrency(); + if (exec_concurrency < nstreams) { + run_streams_test = false; + std::cout << " Skip stream test: concurrency = " << exec_concurrency + << std::endl; + } + } +#endif + if (!run_streams_test) return; + + std::vector weights(nstreams, 1); + std::vector instances = + Kokkos::Experimental::partition_space(execution_space(), weights); + + std::vector kh_v(nstreams); + std::vector kh_ptr_v(nstreams); + std::vector A_row_map_v(nstreams); + std::vector A_entries_v(nstreams); + std::vector A_values_v(nstreams); + std::vector L_row_map_v(nstreams); + std::vector L_entries_v(nstreams); + std::vector L_values_v(nstreams); + std::vector U_row_map_v(nstreams); + std::vector U_entries_v(nstreams); + std::vector U_values_v(nstreams); + + std::vector> Afix = get_fixture(); + + RowMapType row_map, brow_map; + EntriesType entries, bentries; + ValuesType values, bvalues; + + compress_matrix(row_map, entries, values, Afix); + + const size_type nrows = Afix.size(); + const size_type block_size = nrows % 2 == 0 ? 2 : 3; + const size_type block_items = block_size * block_size; + ASSERT_EQ(nrows % block_size, 0); + + // Convert to BSR + Crs crs("crs for block spiluk test", nrows, nrows, values.extent(0), values, + row_map, entries); + Bsr bsr(crs, block_size); + + // Pull out views from BSR + Kokkos::resize(brow_map, bsr.graph.row_map.extent(0)); + Kokkos::resize(bentries, bsr.graph.entries.extent(0)); + Kokkos::resize(bvalues, bsr.values.extent(0)); + Kokkos::deep_copy(brow_map, bsr.graph.row_map); + Kokkos::deep_copy(bentries, bsr.graph.entries); + Kokkos::deep_copy(bvalues, bsr.values); + + const size_type bnrows = brow_map.extent(0) - 1; + const size_type bnnz = bentries.extent(0); + + RowMapType_hostmirror hrow_map("hrow_map", bnrows + 1); + EntriesType_hostmirror hentries("hentries", bnnz); + ValuesType_hostmirror hvalues("hvalues", bnnz * block_items); + + Kokkos::deep_copy(hrow_map, brow_map); + Kokkos::deep_copy(hentries, bentries); + Kokkos::deep_copy(hvalues, bvalues); + + typename KernelHandle::const_nnz_lno_t fill_lev = 2; + + for (int i = 0; i < nstreams; i++) { + // Allocate A as input + A_row_map_v[i] = RowMapType("A_row_map", bnrows + 1); + A_entries_v[i] = EntriesType("A_entries", bnnz); + A_values_v[i] = ValuesType("A_values", bnnz * block_items); + + // Copy from host to device + Kokkos::deep_copy(A_row_map_v[i], hrow_map); + Kokkos::deep_copy(A_entries_v[i], hentries); + Kokkos::deep_copy(A_values_v[i], hvalues); + + // Create handle + kh_v[i] = KernelHandle(); + kh_v[i].create_spiluk_handle(test_algo, bnrows, 4 * bnrows, 4 * bnrows, + block_size); + kh_ptr_v[i] = &kh_v[i]; + + auto spiluk_handle = kh_v[i].get_spiluk_handle(); + + // Allocate L and U as outputs + L_row_map_v[i] = RowMapType("L_row_map", bnrows + 1); + L_entries_v[i] = EntriesType("L_entries", spiluk_handle->get_nnzL()); + U_row_map_v[i] = RowMapType("U_row_map", bnrows + 1); + U_entries_v[i] = EntriesType("U_entries", spiluk_handle->get_nnzU()); + + // Symbolic phase + spiluk_symbolic(kh_ptr_v[i], fill_lev, A_row_map_v[i], A_entries_v[i], + L_row_map_v[i], L_entries_v[i], U_row_map_v[i], + U_entries_v[i], nstreams); + + Kokkos::fence(); + + Kokkos::resize(L_entries_v[i], spiluk_handle->get_nnzL()); + Kokkos::resize(U_entries_v[i], spiluk_handle->get_nnzU()); + L_values_v[i] = + ValuesType("L_values", spiluk_handle->get_nnzL() * block_items); + U_values_v[i] = + ValuesType("U_values", spiluk_handle->get_nnzU() * block_items); + } // Done handle creation and spiluk_symbolic on all streams + + // Numeric phase + spiluk_numeric_streams(instances, kh_ptr_v, fill_lev, A_row_map_v, + A_entries_v, A_values_v, L_row_map_v, L_entries_v, + L_values_v, U_row_map_v, U_entries_v, U_values_v); + + for (int i = 0; i < nstreams; i++) instances[i].fence(); + + // Checking + for (int i = 0; i < nstreams; i++) { + check_result(A_row_map_v[i], A_entries_v[i], A_values_v[i], + L_row_map_v[i], L_entries_v[i], L_values_v[i], + U_row_map_v[i], U_entries_v[i], U_values_v[i], + fill_lev, block_size); + + kh_v[i].destroy_spiluk_handle(); + } + } + + template + static void run_test_spiluk_precond() { + // Test using spiluk as a preconditioner + // Does (LU)^inv Ax = (LU)^inv b converge faster than solving Ax=b? + + // Create a diagonally dominant sparse matrix to test: + using sp_matrix_type = std::conditional_t; + + constexpr auto nrows = 5000; + constexpr auto m = 15; + constexpr auto diagDominance = 2; + constexpr auto tol = 1e-5; + constexpr bool verbose = false; + + if (UseBlocks) { + // Skip test if not on host. block trsv only works on host + static constexpr bool is_host = + std::is_same::value; + if (!is_host) { + return; + } + } + + RowMapType brow_map; + EntriesType bentries; + ValuesType bvalues; + + size_type nnz = 10 * nrows; + auto A_unblocked = + KokkosSparse::Impl::kk_generate_diagonally_dominant_sparse_matrix( + nrows, nrows, nnz, 0, lno_t(0.01 * nrows), diagDominance); + + KokkosSparse::sort_crs_matrix(A_unblocked); + + std::vector block_sizes_blocked = {1, 2, 4, 10}; + std::vector block_sizes_unblocked = {1}; + std::vector block_sizes = + UseBlocks ? block_sizes_blocked : block_sizes_unblocked; + + for (auto block_size : block_sizes) { + // Convert to BSR if block enabled + auto A = get_A(A_unblocked, block_size); + + // Pull out views from BSR + Kokkos::resize(brow_map, A.graph.row_map.extent(0)); + Kokkos::resize(bentries, A.graph.entries.extent(0)); + Kokkos::resize(bvalues, A.values.extent(0)); + Kokkos::deep_copy(brow_map, A.graph.row_map); + Kokkos::deep_copy(bentries, A.graph.entries); + Kokkos::deep_copy(bvalues, A.values); + + // Make kernel handles + KernelHandle kh; + kh.create_gmres_handle(m, tol); + auto gmres_handle = kh.get_gmres_handle(); + gmres_handle->set_verbose(verbose); + using GMRESHandle = + typename std::remove_reference::type; + + for (lno_t fill_lev = 0; fill_lev < 4; ++fill_lev) { + const auto [L_row_map, L_entries, L_values, U_row_map, U_entries, + U_values] = + run_and_check_spiluk(kh, brow_map, bentries, bvalues, + SPILUKAlgorithm::SEQLVLSCHD_TP1, + fill_lev, block_size); + + // Create L, U + auto L = make_matrix("L_Mtx", L_row_map, L_entries, + L_values, block_size); + auto U = make_matrix("U_Mtx", U_row_map, U_entries, + U_values, block_size); + + // Set initial vectors: + ValuesType X("X", nrows); // Solution and initial guess + ValuesType Wj("Wj", nrows); // For checking residuals at end. + ValuesType B(Kokkos::view_alloc(Kokkos::WithoutInitializing, "B"), + nrows); // right-hand side vec + // Make rhs ones so that results are repeatable: + Kokkos::deep_copy(B, 1.0); + + int num_iters_plain(0), num_iters_precond(0); + + // Solve Ax = b + { + gmres(&kh, A, B, X); + + // Double check residuals at end of solve: + float_t nrmB = KokkosBlas::nrm2(B); + KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax + KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. + float_t endRes = KokkosBlas::nrm2(B) / nrmB; + + const auto conv_flag = gmres_handle->get_conv_flag_val(); + num_iters_plain = gmres_handle->get_num_iters(); + + EXPECT_GT(num_iters_plain, 0); + EXPECT_LT(endRes, gmres_handle->get_tol()); + EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); + + if (TEST_SPILUK_VERBOSE_LEVEL > 0) { + std::cout << "Without LUPrec, with block_size=" << block_size + << ", converged in " << num_iters_plain + << " steps with endres=" << endRes << std::endl; + } + } + + // Solve Ax = b with LU preconditioner. + { + gmres_handle->reset_handle(m, tol); + gmres_handle->set_verbose(verbose); + + // Make precond. + KokkosSparse::Experimental::LUPrec + myPrec(L, U); + + // reset X for next gmres call + Kokkos::deep_copy(X, 0.0); + + gmres(&kh, A, B, X, &myPrec); + + // Double check residuals at end of solve: + float_t nrmB = KokkosBlas::nrm2(B); + KokkosSparse::spmv("N", 1.0, A, X, 0.0, Wj); // wj = Ax + KokkosBlas::axpy(-1.0, Wj, B); // b = b-Ax. + float_t endRes = KokkosBlas::nrm2(B) / nrmB; + + const auto conv_flag = gmres_handle->get_conv_flag_val(); + num_iters_precond = gmres_handle->get_num_iters(); + + EXPECT_LT(endRes, gmres_handle->get_tol()); + EXPECT_EQ(conv_flag, GMRESHandle::Flag::Conv); + EXPECT_LT(num_iters_precond, num_iters_plain); + + if (TEST_SPILUK_VERBOSE_LEVEL > 0) { + std::cout << "With LUPrec, with block_size=" << block_size + << ", and fill_level=" << fill_lev << ", converged in " + << num_iters_precond << " steps with endres=" << endRes + << std::endl; + } + } + } + } + } +}; } // namespace Test template void test_spiluk() { - Test::run_test_spiluk(); + using TestStruct = Test::SpilukTest; + TestStruct::run_test_spiluk(); + TestStruct::run_test_spiluk_blocks(); + TestStruct::run_test_spiluk_scale(); + TestStruct::run_test_spiluk_scale_blocks(); + TestStruct::template run_test_spiluk_precond(); + TestStruct::template run_test_spiluk_precond(); } template void test_spiluk_streams() { - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_RP: 1 stream" << std::endl; - Test::run_test_spiluk_streams(0, 1); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_RP: 2 streams" << std::endl; - Test::run_test_spiluk_streams(0, 2); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_RP: 3 streams" << std::endl; - Test::run_test_spiluk_streams(0, 3); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_RP: 4 streams" << std::endl; - Test::run_test_spiluk_streams(0, 4); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_TP1: 1 stream" << std::endl; - Test::run_test_spiluk_streams(1, 1); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_TP1: 2 streams" << std::endl; - Test::run_test_spiluk_streams(1, 2); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_TP1: 3 streams" << std::endl; - Test::run_test_spiluk_streams(1, 3); - - std::cout << "SPILUKAlgorithm::SEQLVLSCHD_TP1: 4 streams" << std::endl; - Test::run_test_spiluk_streams(1, 4); + using TestStruct = Test::SpilukTest; + + TestStruct::run_test_spiluk_streams(SPILUKAlgorithm::SEQLVLSCHD_TP1, 1); + TestStruct::run_test_spiluk_streams(SPILUKAlgorithm::SEQLVLSCHD_TP1, 2); + TestStruct::run_test_spiluk_streams(SPILUKAlgorithm::SEQLVLSCHD_TP1, 3); + TestStruct::run_test_spiluk_streams(SPILUKAlgorithm::SEQLVLSCHD_TP1, 4); + + TestStruct::run_test_spiluk_streams_blocks(SPILUKAlgorithm::SEQLVLSCHD_TP1, + 1); + TestStruct::run_test_spiluk_streams_blocks(SPILUKAlgorithm::SEQLVLSCHD_TP1, + 2); + TestStruct::run_test_spiluk_streams_blocks(SPILUKAlgorithm::SEQLVLSCHD_TP1, + 3); + TestStruct::run_test_spiluk_streams_blocks(SPILUKAlgorithm::SEQLVLSCHD_TP1, + 4); } #define KOKKOSKERNELS_EXECUTE_TEST(SCALAR, ORDINAL, OFFSET, DEVICE) \ diff --git a/sparse/unit_test/Test_Sparse_spmv.hpp b/sparse/unit_test/Test_Sparse_spmv.hpp index 990fcc1a30..c5107fcf0a 100644 --- a/sparse/unit_test/Test_Sparse_spmv.hpp +++ b/sparse/unit_test/Test_Sparse_spmv.hpp @@ -24,7 +24,6 @@ #include #include -#include "KokkosKernels_Controls.hpp" #include "KokkosKernels_default_types.hpp" // #ifndef kokkos_complex_double @@ -180,10 +179,10 @@ void sequential_spmv(crsMat_t input_mat, x_vector_type x, y_vector_type y, Kokkos::fence(); } -template +template void check_spmv( - const KokkosKernels::Experimental::Controls &controls, crsMat_t input_mat, - x_vector_type x, y_vector_type y, + handle_t *handle, crsMat_t input_mat, x_vector_type x, y_vector_type y, typename y_vector_type::non_const_value_type alpha, typename y_vector_type::non_const_value_type beta, const std::string &mode, typename Kokkos::ArithTraits::mag_type @@ -208,7 +207,7 @@ void check_spmv( bool threw = false; std::string msg; try { - KokkosSparse::spmv(controls, mode.data(), alpha, input_mat, x, beta, y); + KokkosSparse::spmv(handle, mode.data(), alpha, input_mat, x, beta, y); Kokkos::fence(); } catch (std::exception &e) { threw = true; @@ -229,9 +228,10 @@ void check_spmv( EXPECT_TRUE(num_errors == 0); } -template +template void check_spmv_mv( - crsMat_t input_mat, x_vector_type x, y_vector_type y, + Handle *handle, crsMat_t input_mat, x_vector_type x, y_vector_type y, y_vector_type expected_y, typename y_vector_type::non_const_value_type alpha, typename y_vector_type::non_const_value_type beta, int numMV, @@ -259,7 +259,7 @@ void check_spmv_mv( bool threw = false; std::string msg; try { - KokkosSparse::spmv(mode.data(), alpha, input_mat, x, beta, y); + KokkosSparse::spmv(handle, mode.data(), alpha, input_mat, x, beta, y); Kokkos::fence(); } catch (std::exception &e) { threw = true; @@ -388,51 +388,6 @@ void check_spmv_mv_struct( } } // check_spmv_mv_struct -template -void check_spmv_controls( - KokkosKernels::Experimental::Controls controls, crsMat_t input_mat, - x_vector_type x, y_vector_type y, - typename y_vector_type::non_const_value_type alpha, - typename y_vector_type::non_const_value_type beta, - typename Kokkos::ArithTraits::mag_type - max_val) { - // typedef typename crsMat_t::StaticCrsGraphType graph_t; - using ExecSpace = typename crsMat_t::execution_space; - using my_exec_space = Kokkos::RangePolicy; - using y_value_type = typename y_vector_type::non_const_value_type; - using y_value_trait = Kokkos::ArithTraits; - using y_value_mag_type = typename y_value_trait::mag_type; - - // y is the quantity being tested here, - // so let us use y_value_type to determine - // the appropriate tolerance precision. - const y_value_mag_type eps = - std::is_same::value ? 2 * 1e-3 : 1e-7; - const size_t nr = input_mat.numRows(); - y_vector_type expected_y("expected", nr); - Kokkos::deep_copy(expected_y, y); - Kokkos::fence(); - - sequential_spmv(input_mat, x, expected_y, alpha, beta); - -#ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - controls.setParameter("algorithm", "merge"); - printf("requested merge based algorithm\n"); -#endif - - KokkosSparse::spmv(controls, "N", alpha, input_mat, x, beta, y); - int num_errors = 0; - Kokkos::parallel_reduce( - "KokkosSparse::Test::spmv", my_exec_space(0, y.extent(0)), - fSPMV(expected_y, y, eps, max_val), - num_errors); - if (num_errors > 0) - printf("KokkosSparse::Test::spmv: %i errors of %i with params: %lf %lf\n", - num_errors, y.extent_int(0), y_value_trait::abs(alpha), - y_value_trait::abs(beta)); - EXPECT_TRUE(num_errors == 0); -} // check_spmv_controls - } // namespace Test template @@ -452,15 +407,16 @@ Kokkos::complex randomUpperBound>(int mag) { template -void test_spmv(const KokkosKernels::Experimental::Controls &controls, - lno_t numRows, size_type nnz, lno_t bandwidth, - lno_t row_size_variance, bool heavy) { +void test_spmv(KokkosSparse::SPMVAlgorithm algo, lno_t numRows, size_type nnz, + lno_t bandwidth, lno_t row_size_variance, bool heavy) { using crsMat_t = typename KokkosSparse::CrsMatrix; using scalar_view_t = typename crsMat_t::values_type::non_const_type; using x_vector_type = scalar_view_t; using y_vector_type = scalar_view_t; using mag_t = typename Kokkos::ArithTraits::mag_type; + using handle_t = + KokkosSparse::SPMVHandle; constexpr mag_t max_x = static_cast(1); constexpr mag_t max_y = static_cast(1); @@ -504,12 +460,17 @@ void test_spmv(const KokkosKernels::Experimental::Controls &controls, testAlphaBeta.push_back(-1.0); testAlphaBeta.push_back(2.5); } + + // This handle can be reused for all following calls, since the matrix does + // not change + handle_t handle(algo); + for (auto mode : nonTransModes) { for (double alpha : testAlphaBeta) { for (double beta : testAlphaBeta) { mag_t max_error = beta * max_y + alpha * max_nnz_per_row * max_val * max_x; - Test::check_spmv(controls, input_mat, input_x, output_y, alpha, beta, + Test::check_spmv(&handle, input_mat, input_x, output_y, alpha, beta, mode, max_error); } } @@ -520,7 +481,7 @@ void test_spmv(const KokkosKernels::Experimental::Controls &controls, // hoping the transpose won't have a long column... mag_t max_error = beta * max_y + alpha * max_nnz_per_row * max_val * max_x; - Test::check_spmv(controls, input_mat, input_xt, output_yt, alpha, beta, + Test::check_spmv(&handle, input_mat, input_xt, output_yt, alpha, beta, mode, max_error); } } @@ -531,29 +492,10 @@ template void test_spmv_algorithms(lno_t numRows, size_type nnz, lno_t bandwidth, lno_t row_size_variance, bool heavy) { - { - KokkosKernels::Experimental::Controls controls; - test_spmv( - controls, numRows, nnz, bandwidth, row_size_variance, heavy); - } - - { - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "native"); - test_spmv( - controls, numRows, nnz, bandwidth, row_size_variance, heavy); - } - { - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "merge"); - test_spmv( - controls, numRows, nnz, bandwidth, row_size_variance, heavy); - } - { - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "native-merge"); - test_spmv( - controls, numRows, nnz, bandwidth, row_size_variance, heavy); + using namespace KokkosSparse; + for (SPMVAlgorithm algo : {SPMV_DEFAULT, SPMV_NATIVE, SPMV_MERGE_PATH}) { + test_spmv(algo, numRows, nnz, bandwidth, + row_size_variance, heavy); } } @@ -573,14 +515,16 @@ void test_spmv_mv(lno_t numRows, size_type nnz, lno_t bandwidth, void, size_type>; using ViewTypeX = Kokkos::View; using ViewTypeY = Kokkos::View; + using handle_t = + KokkosSparse::SPMVHandle; - ViewTypeX b_x("A", numRows, numMV); - ViewTypeY b_y("B", numCols, numMV); - ViewTypeY b_y_copy("B", numCols, numMV); + ViewTypeX b_x("A", numCols, numMV); + ViewTypeY b_y("B", numRows, numMV); + ViewTypeY b_y_copy("B", numRows, numMV); - ViewTypeX b_xt("A", numCols, numMV); - ViewTypeY b_yt("B", numRows, numMV); - ViewTypeY b_yt_copy("B", numRows, numMV); + ViewTypeX b_xt("A", numRows, numMV); + ViewTypeY b_yt("B", numCols, numMV); + ViewTypeY b_yt_copy("B", numCols, numMV); Kokkos::Random_XorShift64_Pool rand_pool( 13718); @@ -613,13 +557,14 @@ void test_spmv_mv(lno_t numRows, size_type nnz, lno_t bandwidth, testAlphaBeta.push_back(-1.0); testAlphaBeta.push_back(2.5); } + handle_t handle; for (auto mode : nonTransModes) { for (double alpha : testAlphaBeta) { for (double beta : testAlphaBeta) { mag_t max_error = beta * max_y + alpha * max_nnz_per_row * max_val * max_x; - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, alpha, beta, numMV, - mode, max_error); + Test::check_spmv_mv(&handle, input_mat, b_x, b_y, b_y_copy, alpha, beta, + numMV, mode, max_error); } } } @@ -629,17 +574,17 @@ void test_spmv_mv(lno_t numRows, size_type nnz, lno_t bandwidth, // hoping the transpose won't have a long column... mag_t max_error = beta * max_y + alpha * max_nnz_per_row * max_val * max_x; - Test::check_spmv_mv(input_mat, b_xt, b_yt, b_yt_copy, alpha, beta, - numMV, mode, max_error); + Test::check_spmv_mv(&handle, input_mat, b_xt, b_yt, b_yt_copy, alpha, + beta, numMV, mode, max_error); } } } } template -void test_spmv_mv_heavy(lno_t numRows, size_type nnz, lno_t bandwidth, - lno_t row_size_variance, int numMV) { + typename layout_x, typename layout_y, class Device> +void test_spmv_mv_heavy(lno_t numRows, lno_t numCols, size_type nnz, + lno_t bandwidth, lno_t row_size_variance, int numMV) { #if defined(KOKKOSKERNELS_ENABLE_TPL_ARMPL) || defined(KOKKOS_ARCH_A64FX) if (std::is_same>::value) { std::cerr @@ -651,16 +596,18 @@ void test_spmv_mv_heavy(lno_t numRows, size_type nnz, lno_t bandwidth, #endif // KOKKOSKERNELS_ENABLE_TPL_ARMPL || KOKKOS_ARCH_A64FX using crsMat_t = typename KokkosSparse::CrsMatrix; - using ViewTypeX = Kokkos::View; - using ViewTypeY = Kokkos::View; + using ViewTypeX = Kokkos::View; + using ViewTypeY = Kokkos::View; using mag_t = typename Kokkos::ArithTraits::mag_type; + using handle_t = + KokkosSparse::SPMVHandle; constexpr mag_t max_x = static_cast(10); constexpr mag_t max_y = static_cast(10); constexpr mag_t max_val = static_cast(10); crsMat_t input_mat = KokkosSparse::Impl::kk_generate_sparse_matrix( - numRows, numRows, nnz, row_size_variance, bandwidth); + numRows, numCols, nnz, row_size_variance, bandwidth); Kokkos::Random_XorShift64_Pool rand_pool( 13718); @@ -668,26 +615,35 @@ void test_spmv_mv_heavy(lno_t numRows, size_type nnz, lno_t bandwidth, numRows ? (nnz / numRows + row_size_variance) : 0; for (int nv = 1; nv <= numMV; nv++) { - ViewTypeX b_x("A", numRows, nv); + ViewTypeX b_x("A", numCols, nv); ViewTypeY b_y("B", numRows, nv); ViewTypeY b_y_copy("B", numRows, nv); + ViewTypeX b_xt("A", numRows, nv); + ViewTypeY b_yt("B", numCols, nv); + ViewTypeY b_yt_copy("B", numCols, nv); + Kokkos::fill_random(b_x, rand_pool, scalar_t(10)); Kokkos::fill_random(b_y, rand_pool, scalar_t(10)); + Kokkos::fill_random(b_xt, rand_pool, scalar_t(10)); + Kokkos::fill_random(b_yt, rand_pool, scalar_t(10)); Kokkos::fill_random(input_mat.values, rand_pool, scalar_t(10)); Kokkos::deep_copy(b_y_copy, b_y); - - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, 1.0, 0.0, nv, "N", - max_nnz_per_row * max_val * max_x); - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, 0.0, 1.0, nv, "N", - max_y); - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, 1.0, 1.0, nv, "N", - max_y + max_nnz_per_row * max_val * max_x); - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, 1.0, 0.0, nv, "T", - max_nnz_per_row * max_val * max_x); - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, 0.0, 1.0, nv, "T", - max_y); + Kokkos::deep_copy(b_yt_copy, b_yt); + + handle_t handle; + + Test::check_spmv_mv(&handle, input_mat, b_x, b_y, b_y_copy, 1.0, 0.0, nv, + "N", max_nnz_per_row * max_val * max_x); + Test::check_spmv_mv(&handle, input_mat, b_x, b_y, b_y_copy, 0.0, 1.0, nv, + "N", max_y); + Test::check_spmv_mv(&handle, input_mat, b_x, b_y, b_y_copy, 1.0, 1.0, nv, + "N", max_y + max_nnz_per_row * max_val * max_x); + Test::check_spmv_mv(&handle, input_mat, b_xt, b_yt, b_yt_copy, 1.0, 0.0, nv, + "T", max_nnz_per_row * max_val * max_x); + Test::check_spmv_mv(&handle, input_mat, b_xt, b_yt, b_yt_copy, 0.0, 1.0, nv, + "T", max_y); // Testing all modes together, since matrix is square std::vector modes = {"N", "C", "T", "H"}; std::vector testAlphaBeta = {0.0, 1.0, -1.0, 2.5}; @@ -696,8 +652,13 @@ void test_spmv_mv_heavy(lno_t numRows, size_type nnz, lno_t bandwidth, for (double beta : testAlphaBeta) { mag_t max_error = beta * max_y + alpha * max_nnz_per_row * max_val * max_x; - Test::check_spmv_mv(input_mat, b_x, b_y, b_y_copy, alpha, beta, nv, - mode, max_error); + if (*mode == 'N' || *mode == 'C') { + Test::check_spmv_mv(&handle, input_mat, b_x, b_y, b_y_copy, alpha, + beta, nv, mode, max_error); + } else { + Test::check_spmv_mv(&handle, input_mat, b_xt, b_yt, b_yt_copy, + alpha, beta, nv, mode, max_error); + } } } } @@ -956,59 +917,6 @@ void test_spmv_mv_struct_1D(lno_t nx, int numMV) { output_y_copy, 1.0, 1.0, numMV, max_error); } -// check that the controls are flowing down correctly in the spmv kernel -template -void test_spmv_controls(lno_t numRows, size_type nnz, lno_t bandwidth, - lno_t row_size_variance, - const KokkosKernels::Experimental::Controls &controls = - KokkosKernels::Experimental::Controls()) { - using crsMat_t = typename KokkosSparse::CrsMatrix; - using scalar_view_t = typename crsMat_t::values_type::non_const_type; - using x_vector_type = scalar_view_t; - using y_vector_type = scalar_view_t; - using mag_t = typename Kokkos::ArithTraits::mag_type; - - constexpr mag_t max_x = static_cast(10); - constexpr mag_t max_y = static_cast(10); - constexpr mag_t max_val = static_cast(10); - - lno_t numCols = numRows; - - crsMat_t input_mat = KokkosSparse::Impl::kk_generate_sparse_matrix( - numRows, numCols, nnz, row_size_variance, bandwidth); - lno_t nr = input_mat.numRows(); - lno_t nc = input_mat.numCols(); - - x_vector_type input_x("x", nc); - y_vector_type output_y("y", nr); - - Kokkos::Random_XorShift64_Pool rand_pool( - 13718); - - Kokkos::fill_random(input_x, rand_pool, max_x); - Kokkos::fill_random(output_y, rand_pool, max_y); - Kokkos::fill_random(input_mat.values, rand_pool, max_val); - - const mag_t max_error = max_y + bandwidth * max_val * max_x; - - Test::check_spmv_controls(controls, input_mat, input_x, output_y, 1.0, 0.0, - max_error); - Test::check_spmv_controls(controls, input_mat, input_x, output_y, 0.0, 1.0, - max_error); - Test::check_spmv_controls(controls, input_mat, input_x, output_y, 1.0, 1.0, - max_error); -} // test_spmv_controls - -// test the native algorithm -template -void test_spmv_native(lno_t numRows, size_type nnz, lno_t bandwidth, - lno_t row_size_variance) { - KokkosKernels::Experimental::Controls controls; - controls.setParameter("algorithm", "native"); - test_spmv_controls(numRows, nnz, bandwidth, row_size_variance, controls); -} // test_spmv_native - // call it if ordinal int and, scalar float and double are instantiated. template void test_github_issue_101() { @@ -1177,6 +1085,10 @@ void test_spmv_all_interfaces_light() { using vector_t = Kokkos::View; using range1D_t = Kokkos::RangePolicy; using range2D_t = Kokkos::MDRangePolicy>; + using v_handle_t = + KokkosSparse::SPMVHandle; + using mv_handle_t = KokkosSparse::SPMVHandle; multivector_t x_mv("x_mv", n, 3); vector_t x("x", n); // Randomize x (it won't be modified after that) @@ -1216,41 +1128,24 @@ void test_spmv_all_interfaces_light() { space_partitions = Kokkos::Experimental::partition_space(space, 1, 1); space = space_partitions[1]; } - KokkosKernels::Experimental::Controls controls; - // All tagged versions - KokkosSparse::spmv(space, controls, "N", 1.0, A, x, 0.0, y, - KokkosSparse::RANK_ONE()); - space.fence(); - verify(); - clear_y(); - KokkosSparse::spmv(controls, "N", 1.0, A, x, 0.0, y, - KokkosSparse::RANK_ONE()); - verify(); - clear_y(); - KokkosSparse::spmv(space, controls, "N", 1.0, A, x_mv, 0.0, y_mv, - KokkosSparse::RANK_TWO()); - space.fence(); - verify_mv(); - clear_y(); - KokkosSparse::spmv(controls, "N", 1.0, A, x_mv, 0.0, y_mv, - KokkosSparse::RANK_TWO()); - verify_mv(); - clear_y(); - // Non-tagged versions - // space and controls - spmv(space, controls, "N", 1.0, A, x, 0.0, y); + + v_handle_t v_handle; + mv_handle_t mv_handle; + + // space and handle + spmv(space, &v_handle, "N", 1.0, A, x, 0.0, y); space.fence(); verify(); clear_y(); - spmv(space, controls, "N", 1.0, A, x_mv, 0.0, y_mv); + spmv(space, &mv_handle, "N", 1.0, A, x_mv, 0.0, y_mv); space.fence(); verify_mv(); clear_y(); - // controls - spmv(controls, "N", 1.0, A, x, 0.0, y); + // handle + spmv(&v_handle, "N", 1.0, A, x, 0.0, y); verify(); clear_y(); - spmv(controls, "N", 1.0, A, x_mv, 0.0, y_mv); + spmv(&mv_handle, "N", 1.0, A, x_mv, 0.0, y_mv); verify_mv(); clear_y(); // space @@ -1291,8 +1186,6 @@ void test_spmv_all_interfaces_light() { 100, 10, false); \ test_spmv_algorithms(10000, 10000 * 2, \ 100, 5, false); \ - test_spmv_controls(10000, 10000 * 20, \ - 100, 5); \ } #define EXECUTE_TEST_INTERFACES(SCALAR, ORDINAL, OFFSET, LAYOUT, DEVICE) \ @@ -1308,19 +1201,30 @@ void test_spmv_all_interfaces_light() { TestCategory, \ sparse##_##spmv_mv##_##SCALAR##_##ORDINAL##_##OFFSET##_##LAYOUT##_##DEVICE) { \ test_spmv_mv( \ - 1000, 1000 * 3, 200, 10, true, 1); \ + 1001, 1001 * 3, 200, 10, true, 1); \ test_spmv_mv( \ - 1000, 1000 * 3, 100, 10, true, 5); \ + 999, 999 * 3, 100, 10, true, 5); \ test_spmv_mv( \ - 1000, 1000 * 2, 100, 5, true, 10); \ + 1003, 1003 * 2, 100, 5, true, 10); \ test_spmv_mv( \ - 50000, 50000 * 3, 20, 10, false, 1); \ + 50007, 50007 * 3, 20, 10, false, 1); \ test_spmv_mv( \ - 50000, 50000 * 3, 100, 10, false, 1); \ + 50002, 50002 * 3, 100, 10, false, 1); \ test_spmv_mv( \ 10000, 10000 * 2, 100, 5, false, 5); \ - test_spmv_mv_heavy( \ - 200, 200 * 10, 60, 4, 30); \ + test_spmv_mv_heavy(204, 201, 204 * 10, 60, 4, 30); \ + test_spmv_mv_heavy(2, 3, 5, 3, 1, 10); \ + } + +#define EXECUTE_TEST_MV_MIXED_LAYOUT(SCALAR, ORDINAL, OFFSET, DEVICE) \ + TEST_F( \ + TestCategory, \ + sparse##_##spmv_mv_mixed_layout##_##SCALAR##_##ORDINAL##_##OFFSET##_##LAYOUT##_##DEVICE) { \ + test_spmv_mv_heavy(99, 101, 100 * 15, 40, 4, \ + 20); \ } #define EXECUTE_TEST_STRUCT(SCALAR, ORDINAL, OFFSET, DEVICE) \ @@ -1387,8 +1291,20 @@ EXECUTE_TEST_ISSUE_101(TestDevice) #include #undef KOKKOSKERNELS_EXECUTE_TEST +#endif + +// Test that requires mixing LayoutLeft and LayoutRight (never an ETI'd +// combination) +#if (!defined(KOKKOSKERNELS_ETI_ONLY) && \ + !defined(KOKKOSKERNELS_IMPL_CHECK_ETI_CALLS)) + +#define KOKKOSKERNELS_EXECUTE_TEST(SCALAR, ORDINAL, OFFSET, DEVICE) \ + EXECUTE_TEST_MV_MIXED_LAYOUT(SCALAR, ORDINAL, OFFSET, TestDevice) -#endif // defined(KOKKOSKERNELS_INST_LAYOUTRIGHT) +#include + +#undef KOKKOSKERNELS_EXECUTE_TEST +#endif #undef EXECUTE_TEST_FN #undef EXECUTE_TEST_STRUCT diff --git a/sparse/unit_test/Test_Sparse_spmv_bsr.hpp b/sparse/unit_test/Test_Sparse_spmv_bsr.hpp index 5b823a22f7..6482d33d8a 100644 --- a/sparse/unit_test/Test_Sparse_spmv_bsr.hpp +++ b/sparse/unit_test/Test_Sparse_spmv_bsr.hpp @@ -40,7 +40,6 @@ #include #include #include -#include "KokkosKernels_Controls.hpp" #include "KokkosKernels_default_types.hpp" #include "KokkosSparse_spmv.hpp" @@ -53,29 +52,6 @@ using kokkos_complex_double = Kokkos::complex; using kokkos_complex_float = Kokkos::complex; -/* Poor-man's std::optional since CUDA 11.0 seems to have an ICE - https://github.com/kokkos/kokkos-kernels/issues/1943 -*/ -struct OptCtrls { - bool present_; - KokkosKernels::Experimental::Controls ctrls_; - - OptCtrls() : present_(false) {} - OptCtrls(const KokkosKernels::Experimental::Controls &ctrls) - : present_(true), ctrls_(ctrls) {} - - operator bool() const { return present_; } - - constexpr const KokkosKernels::Experimental::Controls &operator*() - const &noexcept { - return ctrls_; - } - constexpr const KokkosKernels::Experimental::Controls *operator->() const - noexcept { - return &ctrls_; - } -}; - namespace Test_Spmv_Bsr { /*! \brief Maximum value used to fill A */ @@ -171,10 +147,10 @@ Bsr bsr_random(const int blockSize, const int blockRows, const int blockCols) { /*! \brief test a specific spmv */ -template -void test_spmv(const OptCtrls &controls, const char *mode, const Alpha &alpha, +template +void test_spmv(Handle *handle, const char *mode, const Alpha &alpha, const Beta &beta, const Bsr &a, const Crs &acrs, size_t maxNnzPerRow, const XVector &x, const YVector &y) { using scalar_type = typename Bsr::non_const_value_type; @@ -191,11 +167,7 @@ void test_spmv(const OptCtrls &controls, const char *mode, const Alpha &alpha, YVector yAct("yAct", y.extent(0)); Kokkos::deep_copy(yAct, y); - if (controls) { - KokkosSparse::spmv(*controls, mode, alpha, a, x, beta, yAct); - } else { - KokkosSparse::spmv(mode, alpha, a, x, beta, yAct); - } + KokkosSparse::spmv(handle, mode, alpha, a, x, beta, yAct); // compare yExp and yAct auto hyExp = Kokkos::create_mirror_view(yExp); @@ -223,12 +195,8 @@ void test_spmv(const OptCtrls &controls, const char *mode, const Alpha &alpha, } if (!errIdx.empty()) { - std::string alg; - if (controls) { - alg = controls->getParameter("algorithm", ""); - } else { - alg = ""; - } + std::string alg = + KokkosSparse::get_spmv_algorithm_name(handle->get_algorithm()); std::cerr << __FILE__ << ":" << __LINE__ << " BsrMatrix SpMV failure!" << std::endl; @@ -384,38 +352,43 @@ auto random_vecs_for_spmv(const char *mode, const Bsr &a) { template void test_spmv_combos(const char *mode, const Bsr &a, const Crs &acrs, size_t maxNnzPerRow) { + using namespace KokkosSparse; using scalar_type = typename Bsr::non_const_value_type; using execution_space = typename Bsr::execution_space; auto [x, y] = random_vecs_for_spmv(mode, a); - // cover a variety of controls - using Ctrls = KokkosKernels::Experimental::Controls; - std::vector ctrls = {OptCtrls(), // no controls - OptCtrls(Ctrls()), // empty controls - OptCtrls(Ctrls({{"algorithm", "tpl"}})), - OptCtrls(Ctrls({{"algorithm", "v4.1"}}))}; + using handle_t = SPMVHandle; + // cover a variety of algorithms + std::vector handles; + for (SPMVAlgorithm algo : {SPMV_DEFAULT, SPMV_NATIVE, SPMV_BSR_V41}) + handles.push_back(new handle_t(algo)); + + // Tensor core algorithm temporarily disabled, fails on V100 + /* if constexpr (KokkosKernels::Impl::kk_is_gpu_exec_space()) { #if defined(KOKKOS_ENABLE_CUDA) if constexpr (std::is_same_v) { #if defined(KOKKOS_ARCH_AMPERE) || defined(KOKKOS_ARCH_VOLTA) - ctrls.push_back(OptCtrls(Ctrls({{"algorithm", "experimental_tc"}}))); + handles.push_back(new handle_t(SPMV_BSR_TC)); #if defined(KOKKOS_ARCH_AMPERE) - ctrls.push_back(OptCtrls(Ctrls( - {{"algorithm", "experimental_tc"}, {"tc_precision", "double"}}))); + // Also call SPMV_BSR_TC with Precision = Double on Ampere + handles.push_back(new handle_t(SPMV_BSR_TC)); + handles.back()->bsr_tc_precision = Experimental::Bsr_TC_Precision::Double; #endif // AMPERE #endif // AMPERE || VOLTA } #endif // CUDA } + */ - for (const auto &ctrl : ctrls) { + for (handle_t *handle : handles) { for (scalar_type alpha : {scalar_type(0), scalar_type(1), scalar_type(-1), scalar_type(3.7)}) { for (scalar_type beta : {scalar_type(0), scalar_type(1), scalar_type(-1), scalar_type(-1.5)}) { - test_spmv(ctrl, mode, alpha, beta, a, acrs, maxNnzPerRow, x, y); + test_spmv(handle, mode, alpha, beta, a, acrs, maxNnzPerRow, x, y); } } } @@ -499,9 +472,9 @@ void test_spmv() { // Note: if mode_is_transpose(mode), then maxNnzPerRow is for A^T. Otherwise, // it's for A. -template -void test_spm_mv(const OptCtrls &controls, const char *mode, const Alpha &alpha, +template +void test_spm_mv(Handle *handle, const char *mode, const Alpha &alpha, const Beta &beta, const Bsr &a, const Crs &acrs, size_t maxNnzPerRow, const XVector &x, const YVector &y) { using scalar_type = typename Bsr::non_const_value_type; @@ -518,11 +491,7 @@ void test_spm_mv(const OptCtrls &controls, const char *mode, const Alpha &alpha, YVector yAct("yAct", y.extent(0), y.extent(1)); Kokkos::deep_copy(yAct, y); - if (controls) { - KokkosSparse::spmv(*controls, mode, alpha, a, x, beta, yAct); - } else { - KokkosSparse::spmv(mode, alpha, a, x, beta, yAct); - } + KokkosSparse::spmv(handle, mode, alpha, a, x, beta, yAct); // compare yExp and yAct auto hyExp = Kokkos::create_mirror_view(yExp); @@ -550,12 +519,8 @@ void test_spm_mv(const OptCtrls &controls, const char *mode, const Alpha &alpha, } if (!errIdx.empty()) { - std::string alg; - if (controls) { - alg = controls->getParameter("algorithm", ""); - } else { - alg = ""; - } + std::string alg = + KokkosSparse::get_spmv_algorithm_name(handle->get_algorithm()); std::cerr << __FILE__ << ":" << __LINE__ << " BsrMatrix SpMMV failure!" << std::endl; @@ -621,38 +586,44 @@ auto random_multivecs_for_spm_mv(const char *mode, const Bsr &a, template void test_spm_mv_combos(const char *mode, const Bsr &a, const Crs &acrs, size_t maxNnzPerRow) { + using namespace KokkosSparse; using execution_space = typename Bsr::execution_space; using scalar_type = typename Bsr::non_const_value_type; + using multivector_t = typename MultiVectorTypeFor::type; + using handle_t = + SPMVHandle; - // cover a variety of controls - using Ctrls = KokkosKernels::Experimental::Controls; - std::vector ctrls = {OptCtrls(), // no controls - OptCtrls(Ctrls()), // empty controls - OptCtrls(Ctrls({{"algorithm", "tpl"}})), - OptCtrls(Ctrls({{"algorithm", "v4.1"}}))}; + // cover a variety of algorithms + std::vector handles; + for (SPMVAlgorithm algo : {SPMV_DEFAULT, SPMV_NATIVE, SPMV_BSR_V41}) + handles.push_back(new handle_t(algo)); + // Tensor core algorithm temporarily disabled, fails on V100 + /* if constexpr (KokkosKernels::Impl::kk_is_gpu_exec_space()) { #if defined(KOKKOS_ENABLE_CUDA) if constexpr (std::is_same_v) { #if defined(KOKKOS_ARCH_AMPERE) || defined(KOKKOS_ARCH_VOLTA) - ctrls.push_back(OptCtrls(Ctrls({{"algorithm", "experimental_tc"}}))); + handles.push_back(new handle_t(SPMV_BSR_TC)); #if defined(KOKKOS_ARCH_AMPERE) - ctrls.push_back(OptCtrls(Ctrls( - {{"algorithm", "experimental_tc"}, {"tc_precision", "double"}}))); + // Also call SPMV_BSR_TC with Precision = Double on Ampere + handles.push_back(new handle_t(SPMV_BSR_TC)); + handles.back()->bsr_tc_precision = Experimental::Bsr_TC_Precision::Double; #endif // AMPERE #endif // AMPERE || VOLTA } #endif // CUDA } + */ for (size_t numVecs : {1, 7}) { // num multivecs auto [x, y] = random_multivecs_for_spm_mv(mode, a, numVecs); - for (const auto &ctrl : ctrls) { + for (handle_t *handle : handles) { for (scalar_type alpha : {scalar_type(0), scalar_type(1), scalar_type(-1), scalar_type(3.7)}) { for (scalar_type beta : {scalar_type(0), scalar_type(1), scalar_type(-1), scalar_type(-1.5)}) { - test_spm_mv(ctrl, mode, alpha, beta, a, acrs, maxNnzPerRow, x, y); + test_spm_mv(handle, mode, alpha, beta, a, acrs, maxNnzPerRow, x, y); } } } diff --git a/sparse/unit_test/Test_Sparse_sptrsv.hpp b/sparse/unit_test/Test_Sparse_sptrsv.hpp index 1a4c78e08e..b8b35bc422 100644 --- a/sparse/unit_test/Test_Sparse_sptrsv.hpp +++ b/sparse/unit_test/Test_Sparse_sptrsv.hpp @@ -38,1320 +38,808 @@ using namespace KokkosKernels; using namespace KokkosKernels::Impl; using namespace KokkosKernels::Experimental; -// #ifndef kokkos_complex_double -// #define kokkos_complex_double Kokkos::complex -// #endif -// #ifndef kokkos_complex_float -// #define kokkos_complex_float Kokkos::complex -// #endif - -typedef Kokkos::complex kokkos_complex_double; -typedef Kokkos::complex kokkos_complex_float; +using kokkos_complex_double = Kokkos::complex; +using kokkos_complex_float = Kokkos::complex; namespace Test { -#if 0 -template -void run_test_sptrsv_mtx() { - - typedef typename KokkosSparse::CrsMatrix crsmat_t; - typedef typename crsmat_t::StaticCrsGraphType graph_t; - - //typedef Kokkos::View< size_type*, device > RowMapType; - //typedef Kokkos::View< lno_t*, device > EntriesType; - typedef Kokkos::View< scalar_t*, device > ValuesType; - - // Lower tri - std::cout << "LowerTriTest Begin" << std::endl; - { - -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/L-offshore-amd.mtx"; -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/L-Transport-amd.mtx"; -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/L-Fault_639amd.mtx"; -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/L-thermal2-amd.mtx"; - std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/L-dielFilterV2real-amd.mtx"; - std::cout << "Matrix file: " << mtx_filename << std::endl; - crsmat_t triMtx = KokkosKernels::Impl::read_kokkos_crst_matrix(mtx_filename.c_str()); //in_matrix - graph_t lgraph = triMtx.graph; // in_graph - - auto row_map = lgraph.row_map; - auto entries = lgraph.entries; - auto values = triMtx.values; - - const size_type nrows = lgraph.numRows(); -// const size_type nnz = triMtx.nnz(); - - scalar_t ZERO = scalar_t(0); - scalar_t ONE = scalar_t(1); - - typedef KokkosKernels::Experimental::KokkosKernelsHandle KernelHandle; - - std::cout << "UnitTest nrows = " << nrows << std::endl; - - KernelHandle kh; - bool is_lower_tri = true; - std::cout << "Create handle" << std::endl; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, is_lower_tri); - - std::cout << "Prepare linear system" << std::endl; - // Create known_lhs, generate rhs, then solve for lhs to compare to known_lhs - ValuesType known_lhs("known_lhs", nrows); - // Create known solution lhs set to all 1's - Kokkos::deep_copy(known_lhs, ONE); - - // Solution to find - ValuesType lhs("lhs", nrows); - - // A*known_lhs generates rhs: rhs is dense, use spmv - ValuesType rhs("rhs", nrows); - -// typedef CrsMatrix crsMat_t; -// crsMat_t triMtx("triMtx", nrows, nrows, nnz, values, row_map, entries); - - std::cout << "SPMV" << std::endl; - KokkosSparse::spmv( "N", ONE, triMtx, known_lhs, ZERO, rhs); - - std::cout << "TriSolve Symbolic" << std::endl; - Kokkos::Timer timer; - sptrsv_symbolic( &kh, row_map, entries ); - std::cout << "LTRI Symbolic Time: " << timer.seconds() << std::endl; +template +struct SptrsvTest { + // Define useful types + using RowMapType = Kokkos::View; + using EntriesType = Kokkos::View; + using ValuesType = Kokkos::View; + using RowMapType_hostmirror = typename RowMapType::HostMirror; + using EntriesType_hostmirror = typename EntriesType::HostMirror; + using ValuesType_hostmirror = typename ValuesType::HostMirror; + using execution_space = typename device::execution_space; + using memory_space = typename device::memory_space; + using KernelHandle = KokkosKernels::Experimental::KokkosKernelsHandle< + size_type, lno_t, scalar_t, execution_space, memory_space, memory_space>; - std::cout << "TriSolve Solve" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - timer.reset(); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - std::cout << "LTRI Solve TEAMPOLICY! Time: " << timer.seconds() << std::endl; + using Crs = CrsMatrix; + using Bsr = BsrMatrix; - scalar_t sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), KOKKOS_LAMBDA ( const lno_t i, scalar_t &tsum ) { - tsum += lhs(i); - }, sum); - if ( sum != lhs.extent(0) ) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - } - else { - std::cout << "Lower Tri Solve SUCCESS!" << std::endl; - //std::cout << "Num-levels = " << kh->get_sptrsv_handle()->get_num_levels() << std::endl; - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - - Kokkos::deep_copy(lhs, 0); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHD_RP); - kh.get_sptrsv_handle()->print_algorithm(); - timer.reset(); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - std::cout << "LTRI Solve SEQLVLSCHD_RP Time: " << timer.seconds() << std::endl; - - sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), KOKKOS_LAMBDA ( const lno_t i, scalar_t &tsum ) { - tsum += lhs(i); - }, sum); - if ( sum != lhs.extent(0) ) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - } - else { - std::cout << "Lower Tri Solve SUCCESS!" << std::endl; - //std::cout << "Num-levels = " << kh->get_sptrsv_handle()->get_num_levels() << std::endl; - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - - Kokkos::deep_copy(lhs, 0); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHED_TP2); - kh.get_sptrsv_handle()->print_algorithm(); - timer.reset(); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - std::cout << "LTRI Solve SEQLVLSCHED_TP2 Time: " << timer.seconds() << std::endl; - - sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), KOKKOS_LAMBDA ( const lno_t i, scalar_t &tsum ) { - tsum += lhs(i); - }, sum); - if ( sum != lhs.extent(0) ) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - } - else { - std::cout << "Lower Tri Solve SUCCESS!" << std::endl; - //std::cout << "Num-levels = " << kh->get_sptrsv_handle()->get_num_levels() << std::endl; - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); + using crs_graph_t = typename Crs::StaticCrsGraphType; + using range_policy_t = Kokkos::RangePolicy; - kh.destroy_sptrsv_handle(); + static std::vector> get_5x5_ut_ones_fixture() { + std::vector> A = {{1.00, 0.00, 1.00, 0.00, 0.00}, + {0.00, 1.00, 0.00, 0.00, 1.00}, + {0.00, 0.00, 1.00, 1.00, 1.00}, + {0.00, 0.00, 0.00, 1.00, 1.00}, + {0.00, 0.00, 0.00, 0.00, 1.00}}; + return A; } - // Upper tri - std::cout << "UpperTriTest Begin" << std::endl; - { -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/U-offshore-amd.mtx"; -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/U-Transport-amd.mtx"; -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/U-Fault_639amd.mtx"; -// std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/U-thermal2-amd.mtx"; - std::string mtx_filename = "/ascldap/users/ndellin/TestCodes-GitlabEx/KokkosEcoCodes/KokkosKernels-DevTests/Matrices/U-dielFilterV2real-amd.mtx"; - std::cout << "Matrix file: " << mtx_filename << std::endl; - crsmat_t triMtx = KokkosKernels::Impl::read_kokkos_crst_matrix(mtx_filename.c_str()); //in_matrix - graph_t lgraph = triMtx.graph; // in_graph - - auto row_map = lgraph.row_map; - auto entries = lgraph.entries; - auto values = triMtx.values; - - const size_type nrows = lgraph.numRows(); -// const size_type nnz = triMtx.nnz(); - - scalar_t ZERO = scalar_t(0); - scalar_t ONE = scalar_t(1); - - typedef KokkosKernels::Experimental::KokkosKernelsHandle KernelHandle; - - std::cout << "UnitTest nrows = " << nrows << std::endl; - - KernelHandle kh; - bool is_lower_tri = false; - std::cout << "Create handle" << std::endl; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, is_lower_tri); - - std::cout << "Prepare linear system" << std::endl; - // Create known_lhs, generate rhs, then solve for lhs to compare to known_lhs - ValuesType known_lhs("known_lhs", nrows); - // Create known solution lhs set to all 1's - Kokkos::deep_copy(known_lhs, ONE); - - // Solution to find - ValuesType lhs("lhs", nrows); - - // A*known_lhs generates rhs: rhs is dense, use spmv - ValuesType rhs("rhs", nrows); - -// typedef CrsMatrix crsMat_t; -// crsMat_t triMtx("triMtx", nrows, nrows, nnz, values, row_map, entries); - std::cout << "SPMV" << std::endl; - KokkosSparse::spmv( "N", ONE, triMtx, known_lhs, ZERO, rhs); - - std::cout << "TriSolve Symbolic" << std::endl; - Kokkos::Timer timer; - sptrsv_symbolic( &kh, row_map, entries ); - std::cout << "UTRI Symbolic Time: " << timer.seconds() << std::endl; - - std::cout << "TriSolve Solve" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - timer.reset(); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - std::cout << "UTRI Solve SEQLVLSCHD_TP1 Time: " << timer.seconds() << std::endl; - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), KOKKOS_LAMBDA ( const lno_t i, scalar_t &tsum ) { - tsum += lhs(i); - }, sum); - if ( sum != lhs.extent(0) ) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - } - else { - std::cout << "Upper Tri Solve SUCCESS!" << std::endl; - //std::cout << "Num-levels = " << kh->get_sptrsv_handle()->get_num_levels() << std::endl; - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - - Kokkos::deep_copy(lhs, 0); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHD_RP); - kh.get_sptrsv_handle()->print_algorithm(); - timer.reset(); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - std::cout << "UTRI Solve SEQLVLSCHD_RP Time: " << timer.seconds() << std::endl; - - sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), KOKKOS_LAMBDA ( const lno_t i, scalar_t &tsum ) { - tsum += lhs(i); - }, sum); - if ( sum != lhs.extent(0) ) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - } - else { - std::cout << "Upper Tri Solve SUCCESS!" << std::endl; - //std::cout << "Num-levels = " << kh->get_sptrsv_handle()->get_num_levels() << std::endl; - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - - Kokkos::deep_copy(lhs, 0); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHED_TP2); - kh.get_sptrsv_handle()->print_algorithm(); - timer.reset(); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - std::cout << "UTRI Solve SEQLVLSCHED_TP2 Time: " << timer.seconds() << std::endl; - - sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), KOKKOS_LAMBDA ( const lno_t i, scalar_t &tsum ) { - tsum += lhs(i); - }, sum); - if ( sum != lhs.extent(0) ) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - } - else { - std::cout << "Upper Tri Solve SUCCESS!" << std::endl; - //std::cout << "Num-levels = " << kh->get_sptrsv_handle()->get_num_levels() << std::endl; - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - - kh.destroy_sptrsv_handle(); + static std::vector> get_5x5_ut_fixture() { + const auto KZ = KEEP_ZERO(); + std::vector> A = {{5.00, 1.00, 1.00, 0.00, KZ}, + {KZ, 5.00, KZ, 0.00, 1.00}, + {0.00, 0.00, 5.00, 1.00, 1.00}, + {0.00, 0.00, 0.00, 5.00, 1.00}, + {0.00, 0.00, 0.00, 0.00, 5.00}}; + return A; } -} -#endif - -namespace { -template -struct ReductionCheck { - using lno_t = OrdinalType; - using value_type = ValueType; - - ViewType lhs; + static std::vector> get_5x5_lt_fixture() { + const auto KZ = KEEP_ZERO(); + std::vector> A = {{5.00, KZ, 0.00, 0.00, 0.00}, + {2.00, 5.00, 0.00, 0.00, 0.00}, + {1.00, KZ, 5.00, 0.00, 0.00}, + {0.00, 0.00, 1.00, 5.00, 0.00}, + {KZ, 1.00, 1.00, 1.00, 5.00}}; + return A; + } - ReductionCheck(const ViewType &lhs_) : lhs(lhs_) {} + static std::vector> get_5x5_lt_ones_fixture() { + std::vector> A = {{1.00, 0.00, 0.00, 0.00, 0.00}, + {0.00, 1.00, 0.00, 0.00, 0.00}, + {1.00, 0.00, 1.00, 0.00, 0.00}, + {0.00, 0.00, 1.00, 1.00, 0.00}, + {0.00, 1.00, 1.00, 1.00, 1.00}}; + return A; + } - KOKKOS_INLINE_FUNCTION - void operator()(lno_t i, value_type &tsum) const { tsum += lhs(i); } -}; -} // namespace + struct ReductionCheck { + ValuesType lhs; -template -void run_test_sptrsv() { - typedef Kokkos::View RowMapType; - typedef Kokkos::View EntriesType; - typedef Kokkos::View ValuesType; + ReductionCheck(const ValuesType &lhs_) : lhs(lhs_) {} - scalar_t ZERO = scalar_t(0); - scalar_t ONE = scalar_t(1); + KOKKOS_INLINE_FUNCTION + void operator()(lno_t i, scalar_t &tsum) const { tsum += lhs(i); } + }; - const size_type nrows = 5; - const size_type nnz = 10; + static void run_test_sptrsv() { + scalar_t ZERO = scalar_t(0); + scalar_t ONE = scalar_t(1); - using KernelHandle = KokkosKernels::Experimental::KokkosKernelsHandle< - size_type, lno_t, scalar_t, typename device::execution_space, - typename device::memory_space, typename device::memory_space>; + const size_type nrows = 5; + const size_type nnz = 10; #if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) - using host_crsmat_t = typename KernelHandle::SPTRSVHandleType::host_crsmat_t; - using host_graph_t = typename host_crsmat_t::StaticCrsGraphType; + using host_crsmat_t = + typename KernelHandle::SPTRSVHandleType::host_crsmat_t; + using host_graph_t = typename host_crsmat_t::StaticCrsGraphType; - using row_map_view_t = typename host_graph_t::row_map_type::non_const_type; - using cols_view_t = typename host_graph_t::entries_type::non_const_type; - using values_view_t = typename host_crsmat_t::values_type::non_const_type; + using row_map_view_t = typename host_graph_t::row_map_type::non_const_type; + using cols_view_t = typename host_graph_t::entries_type::non_const_type; + using values_view_t = typename host_crsmat_t::values_type::non_const_type; - // L & U handle for supernodal SpTrsv - KernelHandle khL; - KernelHandle khU; + // L & U handle for supernodal SpTrsv + KernelHandle khL; + KernelHandle khU; - // right-hand-side and solution - ValuesType B("rhs", nrows); - ValuesType X("sol", nrows); + // right-hand-side and solution + ValuesType B("rhs", nrows); + ValuesType X("sol", nrows); - // host CRS for L & U - host_crsmat_t L, U, Ut; + // host CRS for L & U + host_crsmat_t L, U, Ut; #endif - // Upper tri - { - RowMapType row_map("row_map", nrows + 1); - EntriesType entries("entries", nnz); - ValuesType values("values", nnz); - - auto hrow_map = Kokkos::create_mirror_view(row_map); - auto hentries = Kokkos::create_mirror_view(entries); - auto hvalues = Kokkos::create_mirror_view(values); - - hrow_map(0) = 0; - hrow_map(1) = 2; - hrow_map(2) = 4; - hrow_map(3) = 7; - hrow_map(4) = 9; - hrow_map(5) = 10; - - hentries(0) = 0; - hentries(1) = 2; - hentries(2) = 1; - hentries(3) = 4; - hentries(4) = 2; - hentries(5) = 3; - hentries(6) = 4; - hentries(7) = 3; - hentries(8) = 4; - hentries(9) = 4; - - for (size_type i = 0; i < nnz; ++i) { - hvalues(i) = ONE; - } - - Kokkos::deep_copy(row_map, hrow_map); - Kokkos::deep_copy(entries, hentries); - Kokkos::deep_copy(values, hvalues); + // Upper tri + { + RowMapType row_map; + EntriesType entries; + ValuesType values; - // Create known_lhs, generate rhs, then solve for lhs to compare to - // known_lhs - ValuesType known_lhs("known_lhs", nrows); - // Create known solution lhs set to all 1's - Kokkos::deep_copy(known_lhs, ONE); + auto fixture = get_5x5_ut_ones_fixture(); - // Solution to find - ValuesType lhs("lhs", nrows); + compress_matrix(row_map, entries, values, fixture); - // A*known_lhs generates rhs: rhs is dense, use spmv - ValuesType rhs("rhs", nrows); + // Create known_lhs, generate rhs, then solve for lhs to compare to + // known_lhs + ValuesType known_lhs("known_lhs", nrows); + // Create known solution lhs set to all 1's + Kokkos::deep_copy(known_lhs, ONE); - typedef CrsMatrix crsMat_t; - crsMat_t triMtx("triMtx", nrows, nrows, nnz, values, row_map, entries); - KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs); + // Solution to find + ValuesType lhs("lhs", nrows); - { - KernelHandle kh; - bool is_lower_tri = false; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, - is_lower_tri); - - sptrsv_symbolic(&kh, row_map, entries); - Kokkos::fence(); - - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - Kokkos::deep_copy(lhs, ZERO); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHD_RP); - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - // FIXME Issues with various integral type combos - algorithm currently - // unavailable and commented out until fixed - /* - Kokkos::deep_copy(lhs, ZERO); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHED_TP2); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - Kokkos::fence(); - - sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), ReductionCheck(lhs), sum); if ( sum != lhs.extent(0) ) { std::cout << - "Upper Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - */ + // A*known_lhs generates rhs: rhs is dense, use spmv + ValuesType rhs("rhs", nrows); - kh.destroy_sptrsv_handle(); - } + Crs triMtx("triMtx", nrows, nrows, nnz, values, row_map, entries); + KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs); - { - Kokkos::deep_copy(lhs, ZERO); - KernelHandle kh; - bool is_lower_tri = false; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1CHAIN, nrows, - is_lower_tri); - auto chain_threshold = 1; - kh.get_sptrsv_handle()->reset_chain_threshold(chain_threshold); - - sptrsv_symbolic(&kh, row_map, entries); - Kokkos::fence(); - - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); + { + KernelHandle kh; + bool is_lower_tri = false; + kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, + is_lower_tri); + + sptrsv_symbolic(&kh, row_map, entries); + Kokkos::fence(); + + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + Kokkos::deep_copy(lhs, ZERO); + kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHD_RP); + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + // FIXME Issues with various integral type combos - algorithm currently + // unavailable and commented out until fixed + /* + Kokkos::deep_copy(lhs, ZERO); + kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHED_TP2); + sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); + Kokkos::fence(); + + sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0) ); + */ + + kh.destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - kh.destroy_sptrsv_handle(); - } + { + Kokkos::deep_copy(lhs, ZERO); + KernelHandle kh; + bool is_lower_tri = false; + kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1CHAIN, nrows, + is_lower_tri); + auto chain_threshold = 1; + kh.get_sptrsv_handle()->reset_chain_threshold(chain_threshold); + + sptrsv_symbolic(&kh, row_map, entries); + Kokkos::fence(); + + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + kh.destroy_sptrsv_handle(); + } #ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - if (std::is_same::value && - std::is_same::value && - std::is_same::value) { - Kokkos::deep_copy(lhs, ZERO); - KernelHandle kh; - bool is_lower_tri = false; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, - is_lower_tri); - - sptrsv_symbolic(&kh, row_map, entries, values); - Kokkos::fence(); - - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Upper Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); + if (std::is_same::value && + std::is_same::value && + std::is_same::value) { + Kokkos::deep_copy(lhs, ZERO); + KernelHandle kh; + bool is_lower_tri = false; + kh.create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, + is_lower_tri); + + sptrsv_symbolic(&kh, row_map, entries, values); + Kokkos::fence(); + + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + kh.destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - kh.destroy_sptrsv_handle(); - } #endif #if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) - const scalar_t FIVE = scalar_t(5); - const size_type nnz_sp = 14; - { - // U in csr - row_map_view_t hUrowptr("hUrowptr", nrows + 1); - cols_view_t hUcolind("hUcolind", nnz_sp); - values_view_t hUvalues("hUvalues", nnz_sp); - - // rowptr - hUrowptr(0) = 0; - hUrowptr(1) = 4; - hUrowptr(2) = 8; - hUrowptr(3) = 11; - hUrowptr(4) = 13; - hUrowptr(5) = 14; - - // colind - // first row (first supernode) - hUcolind(0) = 0; - hUcolind(1) = 1; - hUcolind(2) = 2; - hUcolind(3) = 4; - // second row (first supernode) - hUcolind(4) = 0; - hUcolind(5) = 1; - hUcolind(6) = 2; - hUcolind(7) = 4; - // third row (second supernode) - hUcolind(8) = 2; - hUcolind(9) = 3; - hUcolind(10) = 4; - // fourth row (third supernode) - hUcolind(11) = 3; - hUcolind(12) = 4; - // fifth row (fourth supernode) - hUcolind(13) = 4; - - // values - // first row (first supernode) - hUvalues(0) = FIVE; - hUvalues(1) = ONE; - hUvalues(2) = ONE; - hUvalues(3) = ZERO; - // second row (first supernode) - hUvalues(4) = ZERO; - hUvalues(5) = FIVE; - hUvalues(6) = ZERO; - hUvalues(7) = ONE; - // third row (second supernode) - hUvalues(8) = FIVE; - hUvalues(9) = ONE; - hUvalues(10) = ONE; - // fourth row (third supernode) - hUvalues(11) = FIVE; - hUvalues(12) = ONE; - // fifth row (fourth supernode) - hUvalues(13) = FIVE; - - // save U for Supernodal Sptrsv - host_graph_t static_graph(hUcolind, hUrowptr); - U = host_crsmat_t("CrsMatrixU", nrows, hUvalues, static_graph); - - // create handle for Supernodal Sptrsv - bool is_lower_tri = false; - khU.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, - is_lower_tri); - - // X = U*ONES to generate B = A*ONES (on device) + const scalar_t FIVE = scalar_t(5); + const size_type nnz_sp = 14; { - RowMapType Urowptr("Urowptr", nrows + 1); - EntriesType Ucolind("Ucolind", nnz_sp); - ValuesType Uvalues("Uvalues", nnz_sp); - - Kokkos::deep_copy(Urowptr, hUrowptr); - Kokkos::deep_copy(Ucolind, hUcolind); - Kokkos::deep_copy(Uvalues, hUvalues); + // U in csr + auto ut_fixture = get_5x5_ut_fixture(); + row_map_view_t hUrowptr; + cols_view_t hUcolind; + values_view_t hUvalues; + + // first row -> first supernode + // second row -> first supernode + // third row -> second supernode + // fourth row -> third supernode + // fifth row -> fourth supernode + + compress_matrix(hUrowptr, hUcolind, hUvalues, ut_fixture); + + // save U for Supernodal Sptrsv + host_graph_t static_graph(hUcolind, hUrowptr); + U = host_crsmat_t("CrsMatrixU", nrows, hUvalues, static_graph); + + // create handle for Supernodal Sptrsv + bool is_lower_tri = false; + khU.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, + is_lower_tri); + + // X = U*ONES to generate B = A*ONES (on device) + { + RowMapType Urowptr("Urowptr", nrows + 1); + EntriesType Ucolind("Ucolind", nnz_sp); + ValuesType Uvalues("Uvalues", nnz_sp); + + Kokkos::deep_copy(Urowptr, hUrowptr); + Kokkos::deep_copy(Ucolind, hUcolind); + Kokkos::deep_copy(Uvalues, hUvalues); + + Crs mtxU("mtxU", nrows, nrows, nnz_sp, Uvalues, Urowptr, Ucolind); + Kokkos::deep_copy(B, ONE); + KokkosSparse::spmv("N", ONE, mtxU, B, ZERO, X); + } + } - crsMat_t mtxU("mtxU", nrows, nrows, nnz_sp, Uvalues, Urowptr, Ucolind); - Kokkos::deep_copy(B, ONE); - KokkosSparse::spmv("N", ONE, mtxU, B, ZERO, X); + { + // U in csc (for inverting off-diag) + row_map_view_t hUcolptr("hUcolptr", nrows + 1); + cols_view_t hUrowind("hUrowind", nnz_sp); + values_view_t hUvalues("hUvalues", nnz_sp); + + // The unsorted ordering seems to matter here, so we cannot use our + // fixture tools. + + hUcolptr(0) = 0; + hUcolptr(1) = 2; + hUcolptr(2) = 4; + hUcolptr(3) = 7; + hUcolptr(4) = 9; + hUcolptr(5) = 14; + + // colind + // first column (first supernode) + hUrowind(0) = 0; + hUrowind(1) = 1; + // second column (first supernode) + hUrowind(2) = 0; + hUrowind(3) = 1; + // third column (second supernode) + hUrowind(4) = 2; + hUrowind(5) = 0; + hUrowind(6) = 1; + // fourth column (third supernode) + hUrowind(7) = 3; + hUrowind(8) = 2; + // fifth column (fourth supernode) + hUrowind(9) = 4; + hUrowind(10) = 0; + hUrowind(11) = 1; + hUrowind(12) = 2; + hUrowind(13) = 3; + + // values + // first column (first supernode) + hUvalues(0) = FIVE; + hUvalues(1) = ZERO; + // second column (first supernode) + hUvalues(2) = ONE; + hUvalues(3) = FIVE; + // third column (second supernode) + hUvalues(4) = FIVE; + hUvalues(5) = ONE; + hUvalues(6) = ZERO; + // fourth column (third supernode) + hUvalues(7) = FIVE; + hUvalues(8) = ONE; + // fifth column (fourth supernode) + hUvalues(9) = FIVE; + hUvalues(10) = ZERO; + hUvalues(11) = ONE; + hUvalues(12) = ONE; + hUvalues(13) = ONE; + + // store Ut in crsmat + host_graph_t static_graph(hUrowind, hUcolptr); + Ut = host_crsmat_t("CrsMatrixUt", nrows, hUvalues, static_graph); } +#endif } + // Lower tri { - // U in csc (for inverting off-diag) - row_map_view_t hUcolptr("hUcolptr", nrows + 1); - cols_view_t hUrowind("hUrowind", nnz_sp); - values_view_t hUvalues("hUvalues", nnz_sp); - - // colptr - hUcolptr(0) = 0; - hUcolptr(1) = 2; - hUcolptr(2) = 4; - hUcolptr(3) = 7; - hUcolptr(4) = 9; - hUcolptr(5) = 14; - - // colind - // first column (first supernode) - hUrowind(0) = 0; - hUrowind(1) = 1; - // second column (first supernode) - hUrowind(2) = 0; - hUrowind(3) = 1; - // third column (second supernode) - hUrowind(4) = 2; - hUrowind(5) = 0; - hUrowind(6) = 1; - // fourth column (third supernode) - hUrowind(7) = 3; - hUrowind(8) = 2; - // fifth column (fourth supernode) - hUrowind(9) = 4; - hUrowind(10) = 0; - hUrowind(11) = 1; - hUrowind(12) = 2; - hUrowind(13) = 3; - - // values - // first column (first supernode) - hUvalues(0) = FIVE; - hUvalues(1) = ZERO; - // second column (first supernode) - hUvalues(2) = ONE; - hUvalues(3) = FIVE; - // third column (second supernode) - hUvalues(4) = FIVE; - hUvalues(5) = ONE; - hUvalues(6) = ZERO; - // fourth column (third supernode) - hUvalues(7) = FIVE; - hUvalues(8) = ONE; - // fifth column (fourth supernode) - hUvalues(9) = FIVE; - hUvalues(10) = ZERO; - hUvalues(11) = ONE; - hUvalues(12) = ONE; - hUvalues(13) = ONE; - - // store Ut in crsmat - host_graph_t static_graph(hUrowind, hUcolptr); - Ut = host_crsmat_t("CrsMatrixUt", nrows, hUvalues, static_graph); - } -#endif - } - - // Lower tri - { - RowMapType row_map("row_map", nrows + 1); - EntriesType entries("entries", nnz); - ValuesType values("values", nnz); - - auto hrow_map = Kokkos::create_mirror_view(row_map); - auto hentries = Kokkos::create_mirror_view(entries); - auto hvalues = Kokkos::create_mirror_view(values); - - hrow_map(0) = 0; - hrow_map(1) = 1; - hrow_map(2) = 2; - hrow_map(3) = 4; - hrow_map(4) = 6; - hrow_map(5) = 10; - - hentries(0) = 0; - hentries(1) = 1; - hentries(2) = 0; - hentries(3) = 2; - hentries(4) = 2; - hentries(5) = 3; - hentries(6) = 1; - hentries(7) = 2; - hentries(8) = 3; - hentries(9) = 4; - - for (size_type i = 0; i < nnz; ++i) { - hvalues(i) = ONE; - } + auto fixture = get_5x5_lt_ones_fixture(); + RowMapType row_map; + EntriesType entries; + ValuesType values; - Kokkos::deep_copy(row_map, hrow_map); - Kokkos::deep_copy(entries, hentries); - Kokkos::deep_copy(values, hvalues); + compress_matrix(row_map, entries, values, fixture); - // Create known_lhs, generate rhs, then solve for lhs to compare to - // known_lhs - ValuesType known_lhs("known_lhs", nrows); - // Create known solution lhs set to all 1's - Kokkos::deep_copy(known_lhs, ONE); + // Create known_lhs, generate rhs, then solve for lhs to compare to + // known_lhs + ValuesType known_lhs("known_lhs", nrows); + // Create known solution lhs set to all 1's + Kokkos::deep_copy(known_lhs, ONE); - // Solution to find - ValuesType lhs("lhs", nrows); + // Solution to find + ValuesType lhs("lhs", nrows); - // A*known_lhs generates rhs: rhs is dense, use spmv - ValuesType rhs("rhs", nrows); + // A*known_lhs generates rhs: rhs is dense, use spmv + ValuesType rhs("rhs", nrows); - typedef CrsMatrix crsMat_t; - crsMat_t triMtx("triMtx", nrows, nrows, nnz, values, row_map, entries); - KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs); + Crs triMtx("triMtx", nrows, nrows, nnz, values, row_map, entries); + KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs); - { - KernelHandle kh; - bool is_lower_tri = true; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, - is_lower_tri); - - sptrsv_symbolic(&kh, row_map, entries); - Kokkos::fence(); - - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - Kokkos::deep_copy(lhs, ZERO); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHD_RP); - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); - } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - // FIXME Issues with various integral type combos - algorithm currently - // unavailable and commented out until fixed - /* - Kokkos::deep_copy(lhs, ZERO); - kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHED_TP2); - sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); - Kokkos::fence(); - - sum = 0.0; - Kokkos::parallel_reduce( Kokkos::RangePolicy(0, lhs.extent(0)), ReductionCheck(lhs), sum); if ( sum != lhs.extent(0) ) { std::cout << - "Lower Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); + { + KernelHandle kh; + bool is_lower_tri = true; + kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, + is_lower_tri); + + sptrsv_symbolic(&kh, row_map, entries); + Kokkos::fence(); + + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + Kokkos::deep_copy(lhs, ZERO); + kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHD_RP); + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + // FIXME Issues with various integral type combos - algorithm currently + // unavailable and commented out until fixed + /* + Kokkos::deep_copy(lhs, ZERO); + kh.get_sptrsv_handle()->set_algorithm(SPTRSVAlgorithm::SEQLVLSCHED_TP2); + sptrsv_solve( &kh, row_map, entries, values, rhs, lhs ); + Kokkos::fence(); + + sum = 0.0; + Kokkos::parallel_reduce( range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ( sum, lhs.extent(0) ); + */ + + kh.destroy_sptrsv_handle(); } - EXPECT_TRUE( sum == scalar_t(lhs.extent(0)) ); - */ - kh.destroy_sptrsv_handle(); - } - - { - Kokkos::deep_copy(lhs, ZERO); - KernelHandle kh; - bool is_lower_tri = true; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1CHAIN, nrows, - is_lower_tri); - auto chain_threshold = 1; - kh.get_sptrsv_handle()->reset_chain_threshold(chain_threshold); - - sptrsv_symbolic(&kh, row_map, entries); - Kokkos::fence(); - - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); + { + Kokkos::deep_copy(lhs, ZERO); + KernelHandle kh; + bool is_lower_tri = true; + kh.create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1CHAIN, nrows, + is_lower_tri); + auto chain_threshold = 1; + kh.get_sptrsv_handle()->reset_chain_threshold(chain_threshold); + + sptrsv_symbolic(&kh, row_map, entries); + Kokkos::fence(); + + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + kh.destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - kh.destroy_sptrsv_handle(); - } #ifdef KOKKOSKERNELS_ENABLE_TPL_CUSPARSE - if (std::is_same::value && - std::is_same::value && - std::is_same::value) { - Kokkos::deep_copy(lhs, ZERO); - KernelHandle kh; - bool is_lower_tri = true; - kh.create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, - is_lower_tri); - - sptrsv_symbolic(&kh, row_map, entries, values); - Kokkos::fence(); - - sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); - Kokkos::fence(); - - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, - lhs.extent(0)), - ReductionCheck(lhs), sum); - if (sum != lhs.extent(0)) { - std::cout << "Lower Tri Solve FAILURE" << std::endl; - kh.get_sptrsv_handle()->print_algorithm(); + if (std::is_same::value && + std::is_same::value && + std::is_same::value) { + Kokkos::deep_copy(lhs, ZERO); + KernelHandle kh; + bool is_lower_tri = true; + kh.create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, + is_lower_tri); + + sptrsv_symbolic(&kh, row_map, entries, values); + Kokkos::fence(); + + sptrsv_solve(&kh, row_map, entries, values, rhs, lhs); + Kokkos::fence(); + + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs.extent(0)), + ReductionCheck(lhs), sum); + EXPECT_EQ(sum, lhs.extent(0)); + + kh.destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(lhs.extent(0))); - - kh.destroy_sptrsv_handle(); - } #endif #if defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV) - { - // L in csc - const scalar_t TWO = scalar_t(2); - const scalar_t FIVE = scalar_t(5); - const size_type nnz_sp = 14; - - row_map_view_t hLcolptr("hUcolptr", nrows + 1); - cols_view_t hLrowind("hUrowind", nnz_sp); - values_view_t hLvalues("hUvalues", nnz_sp); - - // colptr - hLcolptr(0) = 0; - hLcolptr(1) = 4; - hLcolptr(2) = 8; - hLcolptr(3) = 11; - hLcolptr(4) = 13; - hLcolptr(5) = 14; - - // rowind - // first column (first supernode) - hLrowind(0) = 0; - hLrowind(1) = 1; - hLrowind(2) = 2; - hLrowind(3) = 4; - // second column (first supernode) - hLrowind(4) = 0; - hLrowind(5) = 1; - hLrowind(6) = 2; - hLrowind(7) = 4; - // third column (second supernode) - hLrowind(8) = 2; - hLrowind(9) = 3; - hLrowind(10) = 4; - // fourth column (third supernode) - hLrowind(11) = 3; - hLrowind(12) = 4; - // fifth column (fourth supernode) - hLrowind(13) = 4; - - // values - // first column (first supernode) - hLvalues(0) = FIVE; - hLvalues(1) = TWO; - hLvalues(2) = ONE; - hLvalues(3) = ZERO; - // second column (first supernode) - hLvalues(4) = ZERO; - hLvalues(5) = FIVE; - hLvalues(6) = ZERO; - hLvalues(7) = ONE; - // third column (second supernode) - hLvalues(8) = FIVE; - hLvalues(9) = ONE; - hLvalues(10) = ONE; - // fourth column (third supernode) - hLvalues(11) = FIVE; - hLvalues(12) = ONE; - // fifth column (fourth supernode) - hLvalues(13) = FIVE; - - // store Lt in crsmat - host_graph_t static_graph(hLrowind, hLcolptr); - L = host_crsmat_t("CrsMatrixL", nrows, hLvalues, static_graph); - - bool is_lower_tri = true; - khL.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, - is_lower_tri); - - // generate B = A*ONES = L*(U*ONES), where X = U*ONES (on device) { - RowMapType Lcolptr("Lcolptr", nrows + 1); - EntriesType Lrowind("Lrowind", nnz_sp); - ValuesType Lvalues("Lvalues", nnz_sp); - - Kokkos::deep_copy(Lcolptr, hLcolptr); - Kokkos::deep_copy(Lrowind, hLrowind); - Kokkos::deep_copy(Lvalues, hLvalues); - - crsMat_t mtxL("mtxL", nrows, nrows, nnz_sp, Lvalues, Lcolptr, Lrowind); - KokkosSparse::spmv("T", ONE, mtxL, X, ZERO, B); + // L in csc + const size_type nnz_sp = 14; + + // first column (first supernode) + // second column (first supernode) + // third column (second supernode) + // fourth column (third supernode) + // fifth column (fourth supernode) + + auto lt_fixture = get_5x5_lt_fixture(); + row_map_view_t hLcolptr; + cols_view_t hLrowind; + values_view_t hLvalues; + compress_matrix(hLcolptr, hLrowind, hLvalues, lt_fixture); + + // store Lt in crsmat + host_graph_t static_graph(hLrowind, hLcolptr); + L = host_crsmat_t("CrsMatrixL", nrows, hLvalues, static_graph); + + bool is_lower_tri = true; + khL.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, + is_lower_tri); + + // generate B = A*ONES = L*(U*ONES), where X = U*ONES (on device) + { + RowMapType Lcolptr("Lcolptr", nrows + 1); + EntriesType Lrowind("Lrowind", nnz_sp); + ValuesType Lvalues("Lvalues", nnz_sp); + + Kokkos::deep_copy(Lcolptr, hLcolptr); + Kokkos::deep_copy(Lrowind, hLrowind); + Kokkos::deep_copy(Lvalues, hLvalues); + + Crs mtxL("mtxL", nrows, nrows, nnz_sp, Lvalues, Lcolptr, Lrowind); + KokkosSparse::spmv("T", ONE, mtxL, X, ZERO, B); + } } - } - { - // unit-test for supernode SpTrsv (default) - // > set up supernodes (block size = one) - size_type nsupers = 4; - Kokkos::View supercols("supercols", - 1 + nsupers); - supercols(0) = 0; - supercols(1) = 2; // two columns - supercols(2) = 3; // one column - supercols(3) = 4; // one column - supercols(4) = 5; // one column - int *etree = NULL; // we generate graph internally - - // invert diagonal blocks - bool invert_diag = true; - khL.set_sptrsv_invert_diagonal(invert_diag); - khU.set_sptrsv_invert_diagonal(invert_diag); - - // > symbolic (on host) - sptrsv_supernodal_symbolic(nsupers, supercols.data(), etree, L.graph, - &khL, U.graph, &khU); - // > numeric (on host) - sptrsv_compute(&khL, L); - sptrsv_compute(&khU, U); - Kokkos::fence(); - - // > solve - ValuesType b("b", nrows); - Kokkos::deep_copy(b, B); - Kokkos::deep_copy(X, ZERO); - sptrsv_solve(&khL, &khU, X, b); - Kokkos::fence(); - - // > check - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, X.extent(0)), - ReductionCheck(X), sum); - if (sum != lhs.extent(0)) { - std::cout << "Supernode Tri Solve FAILURE : " << sum << " vs." - << lhs.extent(0) << std::endl; - khL.get_sptrsv_handle()->print_algorithm(); - } else { - std::cout << "Supernode Tri Solve SUCCESS" << std::endl; - khL.get_sptrsv_handle()->print_algorithm(); + { + // unit-test for supernode SpTrsv (default) + // > set up supernodes (block size = one) + size_type nsupers = 4; + Kokkos::View supercols("supercols", + 1 + nsupers); + supercols(0) = 0; + supercols(1) = 2; // two columns + supercols(2) = 3; // one column + supercols(3) = 4; // one column + supercols(4) = 5; // one column + int *etree = NULL; // we generate graph internally + + // invert diagonal blocks + bool invert_diag = true; + khL.set_sptrsv_invert_diagonal(invert_diag); + khU.set_sptrsv_invert_diagonal(invert_diag); + + // > symbolic (on host) + sptrsv_supernodal_symbolic(nsupers, supercols.data(), etree, L.graph, + &khL, U.graph, &khU); + // > numeric (on host) + sptrsv_compute(&khL, L); + sptrsv_compute(&khU, U); + Kokkos::fence(); + + // > solve + ValuesType b("b", nrows); + Kokkos::deep_copy(b, B); + Kokkos::deep_copy(X, ZERO); + sptrsv_solve(&khL, &khU, X, b); + Kokkos::fence(); + + // > check + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, X.extent(0)), + ReductionCheck(X), sum); + EXPECT_EQ(sum, lhs.extent(0)); + EXPECT_EQ(sum, X.extent(0)); + + khL.destroy_sptrsv_handle(); + khU.destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(X.extent(0))); - khL.destroy_sptrsv_handle(); - khU.destroy_sptrsv_handle(); - } - - { - // unit-test for supernode SpTrsv (running TRMM on device for compute) - // > set up supernodes - size_type nsupers = 4; - Kokkos::View supercols("supercols", - 1 + nsupers); - supercols(0) = 0; - supercols(1) = 2; // two columns - supercols(2) = 3; // one column - supercols(3) = 4; // one column - supercols(4) = 5; // one column - int *etree = NULL; // we generate tree internally - - // > create handles - KernelHandle khLd; - KernelHandle khUd; - khLd.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, true); - khUd.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, false); - - // > invert diagonal blocks - bool invert_diag = true; - khLd.set_sptrsv_invert_diagonal(invert_diag); - khUd.set_sptrsv_invert_diagonal(invert_diag); - - // > invert off-diagonal blocks - bool invert_offdiag = true; - khUd.set_sptrsv_column_major(true); - khLd.set_sptrsv_invert_offdiagonal(invert_offdiag); - khUd.set_sptrsv_invert_offdiagonal(invert_offdiag); - - // > forcing sptrsv compute to perform TRMM on device - khLd.set_sptrsv_diag_supernode_sizes(1, 1); - khUd.set_sptrsv_diag_supernode_sizes(1, 1); - - // > symbolic (on host) - sptrsv_supernodal_symbolic(nsupers, supercols.data(), etree, L.graph, - &khLd, Ut.graph, &khUd); - // > numeric (on host) - sptrsv_compute(&khLd, L); - sptrsv_compute(&khUd, Ut); - Kokkos::fence(); - - // > solve - ValuesType b("b", nrows); - Kokkos::deep_copy(b, B); - Kokkos::deep_copy(X, ZERO); - sptrsv_solve(&khLd, &khUd, X, b); - Kokkos::fence(); - - // > check - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy(0, X.extent(0)), - ReductionCheck(X), sum); - if (sum != lhs.extent(0)) { - std::cout << "Supernode Tri Solve FAILURE : " << sum << " vs." - << lhs.extent(0) << std::endl; - khLd.get_sptrsv_handle()->print_algorithm(); - } else { - std::cout << "Supernode Tri Solve SUCCESS" << std::endl; - khLd.get_sptrsv_handle()->print_algorithm(); + { + // unit-test for supernode SpTrsv (running TRMM on device for compute) + // > set up supernodes + size_type nsupers = 4; + Kokkos::View supercols("supercols", + 1 + nsupers); + supercols(0) = 0; + supercols(1) = 2; // two columns + supercols(2) = 3; // one column + supercols(3) = 4; // one column + supercols(4) = 5; // one column + int *etree = NULL; // we generate tree internally + + // > create handles + KernelHandle khLd; + KernelHandle khUd; + khLd.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, true); + khUd.create_sptrsv_handle(SPTRSVAlgorithm::SUPERNODAL_DAG, nrows, + false); + + // > invert diagonal blocks + bool invert_diag = true; + khLd.set_sptrsv_invert_diagonal(invert_diag); + khUd.set_sptrsv_invert_diagonal(invert_diag); + + // > invert off-diagonal blocks + bool invert_offdiag = true; + khUd.set_sptrsv_column_major(true); + khLd.set_sptrsv_invert_offdiagonal(invert_offdiag); + khUd.set_sptrsv_invert_offdiagonal(invert_offdiag); + + // > forcing sptrsv compute to perform TRMM on device + khLd.set_sptrsv_diag_supernode_sizes(1, 1); + khUd.set_sptrsv_diag_supernode_sizes(1, 1); + + // > symbolic (on host) + sptrsv_supernodal_symbolic(nsupers, supercols.data(), etree, L.graph, + &khLd, Ut.graph, &khUd); + // > numeric (on host) + sptrsv_compute(&khLd, L); + sptrsv_compute(&khUd, Ut); + Kokkos::fence(); + + // > solve + ValuesType b("b", nrows); + Kokkos::deep_copy(b, B); + Kokkos::deep_copy(X, ZERO); + sptrsv_solve(&khLd, &khUd, X, b); + Kokkos::fence(); + + // > check + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, X.extent(0)), + ReductionCheck(X), sum); + EXPECT_EQ(sum, lhs.extent(0)); + EXPECT_EQ(sum, X.extent(0)); + + khLd.destroy_sptrsv_handle(); + khUd.destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(X.extent(0))); - - khLd.destroy_sptrsv_handle(); - khUd.destroy_sptrsv_handle(); - } #endif + } } -} - -template -void run_test_sptrsv_streams(int test_algo, int nstreams) { - using RowMapType = Kokkos::View; - using EntriesType = Kokkos::View; - using ValuesType = Kokkos::View; - using RowMapType_hostmirror = typename RowMapType::HostMirror; - using EntriesType_hostmirror = typename EntriesType::HostMirror; - using ValuesType_hostmirror = typename ValuesType::HostMirror; - using execution_space = typename device::execution_space; - using memory_space = typename device::memory_space; - using KernelHandle = KokkosKernels::Experimental::KokkosKernelsHandle< - size_type, lno_t, scalar_t, execution_space, memory_space, memory_space>; - using crsMat_t = CrsMatrix; - // Workaround for OpenMP: skip tests if concurrency < nstreams because of - // not enough resource to partition - bool run_streams_test = true; + static void run_test_sptrsv_streams(int test_algo, int nstreams) { + // Workaround for OpenMP: skip tests if concurrency < nstreams because of + // not enough resource to partition + bool run_streams_test = true; #ifdef KOKKOS_ENABLE_OPENMP - if (std::is_same::value) { - int exec_concurrency = execution_space().concurrency(); - if (exec_concurrency < nstreams) { - run_streams_test = false; - std::cout << " Skip stream test: concurrency = " << exec_concurrency - << std::endl; + if (std::is_same::value) { + int exec_concurrency = execution_space().concurrency(); + if (exec_concurrency < nstreams) { + run_streams_test = false; + std::cout << " Skip stream test: concurrency = " << exec_concurrency + << std::endl; + } } - } #endif - if (!run_streams_test) return; - - scalar_t ZERO = scalar_t(0); - scalar_t ONE = scalar_t(1); - - const size_type nrows = 5; - const size_type nnz = 10; - - std::vector instances; - if (nstreams == 1) - instances = Kokkos::Experimental::partition_space(execution_space(), 1); - else if (nstreams == 2) - instances = Kokkos::Experimental::partition_space(execution_space(), 1, 1); - else if (nstreams == 3) - instances = - Kokkos::Experimental::partition_space(execution_space(), 1, 1, 1); - else // (nstreams == 4) - instances = - Kokkos::Experimental::partition_space(execution_space(), 1, 1, 1, 1); - - std::vector kh_v(nstreams); - std::vector kh_ptr_v(nstreams); - std::vector row_map_v(nstreams); - std::vector entries_v(nstreams); - std::vector values_v(nstreams); - std::vector rhs_v(nstreams); - std::vector lhs_v(nstreams); - - RowMapType_hostmirror hrow_map("hrow_map", nrows + 1); - EntriesType_hostmirror hentries("hentries", nnz); - ValuesType_hostmirror hvalues("hvalues", nnz); - - // Upper tri - { - hrow_map(0) = 0; - hrow_map(1) = 2; - hrow_map(2) = 4; - hrow_map(3) = 7; - hrow_map(4) = 9; - hrow_map(5) = 10; - - hentries(0) = 0; - hentries(1) = 2; - hentries(2) = 1; - hentries(3) = 4; - hentries(4) = 2; - hentries(5) = 3; - hentries(6) = 4; - hentries(7) = 3; - hentries(8) = 4; - hentries(9) = 4; - - for (size_type i = 0; i < nnz; ++i) { - hvalues(i) = ONE; - } + if (!run_streams_test) return; - for (int i = 0; i < nstreams; i++) { - // Allocate U - row_map_v[i] = RowMapType("row_map", nrows + 1); - entries_v[i] = EntriesType("entries", nnz); - values_v[i] = ValuesType("values", nnz); + scalar_t ZERO = scalar_t(0); + scalar_t ONE = scalar_t(1); - // Copy from host to device - Kokkos::deep_copy(row_map_v[i], hrow_map); - Kokkos::deep_copy(entries_v[i], hentries); - Kokkos::deep_copy(values_v[i], hvalues); + const size_type nrows = 5; + const size_type nnz = 10; - // Create known_lhs, generate rhs, then solve for lhs to compare to - // known_lhs - ValuesType known_lhs("known_lhs", nrows); - // Create known solution lhs set to all 1's - Kokkos::deep_copy(known_lhs, ONE); + auto instances = Kokkos::Experimental::partition_space( + execution_space(), std::vector(nstreams, 1)); - // Solution to find - lhs_v[i] = ValuesType("lhs", nrows); + std::vector kh_v(nstreams); + std::vector kh_ptr_v(nstreams); + std::vector row_map_v(nstreams); + std::vector entries_v(nstreams); + std::vector values_v(nstreams); + std::vector rhs_v(nstreams); + std::vector lhs_v(nstreams); - // A*known_lhs generates rhs: rhs is dense, use spmv - rhs_v[i] = ValuesType("rhs", nrows); - - crsMat_t triMtx("triMtx", nrows, nrows, nnz, values_v[i], row_map_v[i], - entries_v[i]); - - KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs_v[i]); - Kokkos::fence(); - - // Create handle - kh_v[i] = KernelHandle(); - bool is_lower_tri = false; - if (test_algo == 0) - kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_RP, nrows, - is_lower_tri); - else if (test_algo == 1) - kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, - is_lower_tri); - else - kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, - is_lower_tri); - - kh_ptr_v[i] = &kh_v[i]; - - // Symbolic phase - sptrsv_symbolic(kh_ptr_v[i], row_map_v[i], entries_v[i], values_v[i]); - Kokkos::fence(); - } // Done handle creation and sptrsv_symbolic on all streams - - // Solve phase - sptrsv_solve_streams(instances, kh_ptr_v, row_map_v, entries_v, values_v, - rhs_v, lhs_v); - - for (int i = 0; i < nstreams; i++) instances[i].fence(); - - // Checking - for (int i = 0; i < nstreams; i++) { - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy( - 0, lhs_v[i].extent(0)), - ReductionCheck(lhs_v[i]), sum); - if (sum != lhs_v[i].extent(0)) { - std::cout << "Upper Tri Solve FAILURE on stream " << i << std::endl; - kh_v[i].get_sptrsv_handle()->print_algorithm(); - } - EXPECT_TRUE(sum == scalar_t(lhs_v[i].extent(0))); + RowMapType_hostmirror hrow_map; + EntriesType_hostmirror hentries; + ValuesType_hostmirror hvalues; - kh_v[i].destroy_sptrsv_handle(); - } - } - - // Lower tri - { - hrow_map(0) = 0; - hrow_map(1) = 1; - hrow_map(2) = 2; - hrow_map(3) = 4; - hrow_map(4) = 6; - hrow_map(5) = 10; - - hentries(0) = 0; - hentries(1) = 1; - hentries(2) = 0; - hentries(3) = 2; - hentries(4) = 2; - hentries(5) = 3; - hentries(6) = 1; - hentries(7) = 2; - hentries(8) = 3; - hentries(9) = 4; - - for (size_type i = 0; i < nnz; ++i) { - hvalues(i) = ONE; + // Upper tri + { + auto fixture = get_5x5_ut_ones_fixture(); + compress_matrix(hrow_map, hentries, hvalues, fixture); + + for (int i = 0; i < nstreams; i++) { + // Allocate U + row_map_v[i] = RowMapType("row_map", nrows + 1); + entries_v[i] = EntriesType("entries", nnz); + values_v[i] = ValuesType("values", nnz); + + // Copy from host to device + Kokkos::deep_copy(row_map_v[i], hrow_map); + Kokkos::deep_copy(entries_v[i], hentries); + Kokkos::deep_copy(values_v[i], hvalues); + + // Create known_lhs, generate rhs, then solve for lhs to compare to + // known_lhs + ValuesType known_lhs("known_lhs", nrows); + // Create known solution lhs set to all 1's + Kokkos::deep_copy(known_lhs, ONE); + + // Solution to find + lhs_v[i] = ValuesType("lhs", nrows); + + // A*known_lhs generates rhs: rhs is dense, use spmv + rhs_v[i] = ValuesType("rhs", nrows); + + Crs triMtx("triMtx", nrows, nrows, nnz, values_v[i], row_map_v[i], + entries_v[i]); + + KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs_v[i]); + Kokkos::fence(); + + // Create handle + kh_v[i] = KernelHandle(); + bool is_lower_tri = false; + if (test_algo == 0) + kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_RP, nrows, + is_lower_tri); + else if (test_algo == 1) + kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, + is_lower_tri); + else + kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, + is_lower_tri); + + kh_ptr_v[i] = &kh_v[i]; + + // Symbolic phase + sptrsv_symbolic(kh_ptr_v[i], row_map_v[i], entries_v[i], values_v[i]); + Kokkos::fence(); + } // Done handle creation and sptrsv_symbolic on all streams + + // Solve phase + sptrsv_solve_streams(instances, kh_ptr_v, row_map_v, entries_v, values_v, + rhs_v, lhs_v); + + for (int i = 0; i < nstreams; i++) instances[i].fence(); + + // Checking + for (int i = 0; i < nstreams; i++) { + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs_v[i].extent(0)), + ReductionCheck(lhs_v[i]), sum); + EXPECT_EQ(sum, lhs_v[i].extent(0)); + + kh_v[i].destroy_sptrsv_handle(); + } } - for (int i = 0; i < nstreams; i++) { - // Allocate L - row_map_v[i] = RowMapType("row_map", nrows + 1); - entries_v[i] = EntriesType("entries", nnz); - values_v[i] = ValuesType("values", nnz); - - // Copy from host to device - Kokkos::deep_copy(row_map_v[i], hrow_map); - Kokkos::deep_copy(entries_v[i], hentries); - Kokkos::deep_copy(values_v[i], hvalues); - - // Create known_lhs, generate rhs, then solve for lhs to compare to - // known_lhs - ValuesType known_lhs("known_lhs", nrows); - // Create known solution lhs set to all 1's - Kokkos::deep_copy(known_lhs, ONE); - - // Solution to find - lhs_v[i] = ValuesType("lhs", nrows); - - // A*known_lhs generates rhs: rhs is dense, use spmv - rhs_v[i] = ValuesType("rhs", nrows); - - crsMat_t triMtx("triMtx", nrows, nrows, nnz, values_v[i], row_map_v[i], - entries_v[i]); - - KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs_v[i]); - Kokkos::fence(); - - // Create handle - kh_v[i] = KernelHandle(); - bool is_lower_tri = true; - if (test_algo == 0) - kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_RP, nrows, - is_lower_tri); - else if (test_algo == 1) - kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, - is_lower_tri); - else - kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, - is_lower_tri); - - kh_ptr_v[i] = &kh_v[i]; - - // Symbolic phase - sptrsv_symbolic(kh_ptr_v[i], row_map_v[i], entries_v[i], values_v[i]); - Kokkos::fence(); - } // Done handle creation and sptrsv_symbolic on all streams - - // Solve phase - sptrsv_solve_streams(instances, kh_ptr_v, row_map_v, entries_v, values_v, - rhs_v, lhs_v); - - for (int i = 0; i < nstreams; i++) instances[i].fence(); - - // Checking - for (int i = 0; i < nstreams; i++) { - scalar_t sum = 0.0; - Kokkos::parallel_reduce( - Kokkos::RangePolicy( - 0, lhs_v[i].extent(0)), - ReductionCheck(lhs_v[i]), sum); - if (sum != lhs_v[i].extent(0)) { - std::cout << "Lower Tri Solve FAILURE on stream " << i << std::endl; - kh_v[i].get_sptrsv_handle()->print_algorithm(); + // Lower tri + { + auto fixture = get_5x5_lt_ones_fixture(); + compress_matrix(hrow_map, hentries, hvalues, fixture); + + for (int i = 0; i < nstreams; i++) { + // Allocate L + row_map_v[i] = RowMapType("row_map", nrows + 1); + entries_v[i] = EntriesType("entries", nnz); + values_v[i] = ValuesType("values", nnz); + + // Copy from host to device + Kokkos::deep_copy(row_map_v[i], hrow_map); + Kokkos::deep_copy(entries_v[i], hentries); + Kokkos::deep_copy(values_v[i], hvalues); + + // Create known_lhs, generate rhs, then solve for lhs to compare to + // known_lhs + ValuesType known_lhs("known_lhs", nrows); + // Create known solution lhs set to all 1's + Kokkos::deep_copy(known_lhs, ONE); + + // Solution to find + lhs_v[i] = ValuesType("lhs", nrows); + + // A*known_lhs generates rhs: rhs is dense, use spmv + rhs_v[i] = ValuesType("rhs", nrows); + + Crs triMtx("triMtx", nrows, nrows, nnz, values_v[i], row_map_v[i], + entries_v[i]); + + KokkosSparse::spmv("N", ONE, triMtx, known_lhs, ZERO, rhs_v[i]); + Kokkos::fence(); + + // Create handle + kh_v[i] = KernelHandle(); + bool is_lower_tri = true; + if (test_algo == 0) + kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_RP, nrows, + is_lower_tri); + else if (test_algo == 1) + kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SEQLVLSCHD_TP1, nrows, + is_lower_tri); + else + kh_v[i].create_sptrsv_handle(SPTRSVAlgorithm::SPTRSV_CUSPARSE, nrows, + is_lower_tri); + + kh_ptr_v[i] = &kh_v[i]; + + // Symbolic phase + sptrsv_symbolic(kh_ptr_v[i], row_map_v[i], entries_v[i], values_v[i]); + Kokkos::fence(); + } // Done handle creation and sptrsv_symbolic on all streams + + // Solve phase + sptrsv_solve_streams(instances, kh_ptr_v, row_map_v, entries_v, values_v, + rhs_v, lhs_v); + + for (int i = 0; i < nstreams; i++) instances[i].fence(); + + // Checking + for (int i = 0; i < nstreams; i++) { + scalar_t sum = 0.0; + Kokkos::parallel_reduce(range_policy_t(0, lhs_v[i].extent(0)), + ReductionCheck(lhs_v[i]), sum); + EXPECT_EQ(sum, lhs_v[i].extent(0)); + + kh_v[i].destroy_sptrsv_handle(); } - EXPECT_TRUE(sum == scalar_t(lhs_v[i].extent(0))); - - kh_v[i].destroy_sptrsv_handle(); } } -} +}; } // namespace Test template void test_sptrsv() { - Test::run_test_sptrsv(); - // Test::run_test_sptrsv_mtx(); + using TestStruct = Test::SptrsvTest; + TestStruct::run_test_sptrsv(); } template void test_sptrsv_streams() { - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_RP: 1 stream" << std::endl; - Test::run_test_sptrsv_streams(0, 1); - - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_RP: 2 streams" << std::endl; - Test::run_test_sptrsv_streams(0, 2); - - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_RP: 3 streams" << std::endl; - Test::run_test_sptrsv_streams(0, 3); - - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_RP: 4 streams" << std::endl; - Test::run_test_sptrsv_streams(0, 4); - - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_TP1: 1 stream" << std::endl; - Test::run_test_sptrsv_streams(1, 1); + using TestStruct = Test::SptrsvTest; - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_TP1: 2 streams" << std::endl; - Test::run_test_sptrsv_streams(1, 2); - - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_TP1: 3 streams" << std::endl; - Test::run_test_sptrsv_streams(1, 3); - - std::cout << "SPTRSVAlgorithm::SEQLVLSCHD_TP1: 4 streams" << std::endl; - Test::run_test_sptrsv_streams(1, 4); + TestStruct::run_test_sptrsv_streams(0, 1); + TestStruct::run_test_sptrsv_streams(0, 2); + TestStruct::run_test_sptrsv_streams(0, 3); + TestStruct::run_test_sptrsv_streams(0, 4); + TestStruct::run_test_sptrsv_streams(1, 1); + TestStruct::run_test_sptrsv_streams(1, 2); + TestStruct::run_test_sptrsv_streams(1, 3); + TestStruct::run_test_sptrsv_streams(1, 4); #if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOSKERNELS_ENABLE_TPL_CUSPARSE) if (std::is_same::value && std::is_same::value) { - std::cout << "SPTRSVAlgorithm::SPTRSV_CUSPARSE: 1 stream" << std::endl; - Test::run_test_sptrsv_streams(2, 1); - - std::cout << "SPTRSVAlgorithm::SPTRSV_CUSPARSE: 2 streams" << std::endl; - Test::run_test_sptrsv_streams(2, 2); - - std::cout << "SPTRSVAlgorithm::SPTRSV_CUSPARSE: 3 streams" << std::endl; - Test::run_test_sptrsv_streams(2, 3); - - std::cout << "SPTRSVAlgorithm::SPTRSV_CUSPARSE: 4 streams" << std::endl; - Test::run_test_sptrsv_streams(2, 4); + TestStruct::run_test_sptrsv_streams(2, 1); + TestStruct::run_test_sptrsv_streams(2, 2); + TestStruct::run_test_sptrsv_streams(2, 3); + TestStruct::run_test_sptrsv_streams(2, 4); } #endif } diff --git a/sparse/unit_test/Test_Sparse_trsv.hpp b/sparse/unit_test/Test_Sparse_trsv.hpp index d580cc472d..8fb4763d71 100644 --- a/sparse/unit_test/Test_Sparse_trsv.hpp +++ b/sparse/unit_test/Test_Sparse_trsv.hpp @@ -34,89 +34,131 @@ typedef Kokkos::complex kokkos_complex_double; typedef Kokkos::complex kokkos_complex_float; namespace Test { -// TODO: remove this once MD develop branch is merge. -// The below functionolity exists in SparseUtils. - -template -void check_trsv_mv(crsMat_t input_mat, x_vector_type x, y_vector_type b, - y_vector_type expected_x, int numMV, const char uplo[], - const char trans[]) { - // typedef typename crsMat_t::StaticCrsGraphType graph_t; - typedef typename crsMat_t::values_type::non_const_type scalar_view_t; - typedef typename scalar_view_t::value_type ScalarA; - double eps = (std::is_same::value - ? 2 * 1e-2 - : (std::is_same>::value || - std::is_same>::value) - ? 2 * 1e-1 - : 1e-7); - - Kokkos::fence(); - KokkosSparse::trsv(uplo, trans, "N", input_mat, b, x); - - for (int i = 0; i < numMV; ++i) { - auto x_i = Kokkos::subview(x, Kokkos::ALL(), i); - - auto expected_x_i = Kokkos::subview(expected_x, Kokkos::ALL(), i); - - EXPECT_NEAR_KK_1DVIEW(expected_x_i, x_i, eps); - } + +template < + typename Crs, typename LUType, typename size_type, + typename std::enable_if::value>::type* = nullptr> +LUType get_LU(char l_or_u, int n, size_type& nnz, int row_size_variance, + int bandwidth, int) { + auto LU = KokkosSparse::Impl::kk_generate_triangular_sparse_matrix( + l_or_u, n, n, nnz, row_size_variance, bandwidth); + + return LU; +} + +template < + typename Crs, typename LUType, typename size_type, + typename std::enable_if::value>::type* = nullptr> +LUType get_LU(char l_or_u, int n, size_type& nnz, int row_size_variance, + int bandwidth, int block_size) { + auto LU_unblocked = + KokkosSparse::Impl::kk_generate_triangular_sparse_matrix( + l_or_u, n, n, nnz, row_size_variance, bandwidth); + + // Convert to BSR + LUType LU(LU_unblocked, block_size); + + return LU; } -} // namespace Test template -void test_trsv_mv(lno_t numRows, size_type nnz, lno_t bandwidth, - lno_t row_size_variance, int numMV) { - lno_t numCols = numRows; + typename layout, typename device> +struct TrsvTest { + using View2D = Kokkos::View; + using execution_space = typename device::execution_space; + + using Crs = CrsMatrix; + using Bsr = BsrMatrix; + + // TODO: remove this once MD develop branch is merge. + // The below functionolity exists in SparseUtils. + template + static void check_trsv_mv(sp_matrix_type input_mat, View2D x, View2D b, + View2D expected_x, int numMV, const char uplo[], + const char trans[]) { + double eps = (std::is_same::value + ? 2 * 1e-2 + : (std::is_same>::value || + std::is_same>::value) + ? 2 * 1e-1 + : 1e-7); + + Kokkos::fence(); + KokkosSparse::trsv(uplo, trans, "N", input_mat, b, x); + + for (int i = 0; i < numMV; ++i) { + auto x_i = Kokkos::subview(x, Kokkos::ALL(), i); + + auto expected_x_i = Kokkos::subview(expected_x, Kokkos::ALL(), i); + + EXPECT_NEAR_KK_1DVIEW(expected_x_i, x_i, eps); + } + } + + template + static void test_trsv_mv(lno_t numRows, size_type nnz, lno_t bandwidth, + lno_t row_size_variance, int numMV) { + using sp_matrix_type = std::conditional_t; - typedef - typename KokkosSparse::CrsMatrix - crsMat_t; - // typedef typename crsMat_t::values_type::non_const_type scalar_view_t; + constexpr auto block_size = UseBlocks ? 10 : 1; - typedef Kokkos::View ViewTypeX; - typedef Kokkos::View ViewTypeY; + lno_t numCols = numRows; - ViewTypeX b_x("A", numRows, numMV); - ViewTypeY b_y("B", numCols, numMV); - ViewTypeX b_x_copy("B", numCols, numMV); + View2D b_x("A", numRows, numMV); + View2D b_y("B", numCols, numMV); + View2D b_x_copy("B", numCols, numMV); - Kokkos::Random_XorShift64_Pool rand_pool( - 13718); - Kokkos::fill_random(b_x_copy, rand_pool, scalar_t(10)); + Kokkos::Random_XorShift64_Pool rand_pool(13718); + Kokkos::fill_random(b_x_copy, rand_pool, scalar_t(10)); - typename ViewTypeY::non_const_value_type alpha = 1; - typename ViewTypeY::non_const_value_type beta = 0; + scalar_t alpha = 1; + scalar_t beta = 0; - // this function creates a dense lower and upper triangular matrix. - // TODO: SHOULD CHANGE IT TO SPARSE - crsMat_t lower_part = - KokkosSparse::Impl::kk_generate_triangular_sparse_matrix( - 'L', numRows, numCols, nnz, row_size_variance, bandwidth); + // this function creates a dense lower and upper triangular matrix. + auto lower_part = get_LU( + 'L', numRows, nnz, row_size_variance, bandwidth, block_size); - Test::shuffleMatrixEntries(lower_part.graph.row_map, lower_part.graph.entries, - lower_part.values); + Test::shuffleMatrixEntries(lower_part.graph.row_map, + lower_part.graph.entries, lower_part.values, + block_size); - KokkosSparse::spmv("N", alpha, lower_part, b_x_copy, beta, b_y); - Test::check_trsv_mv(lower_part, b_x, b_y, b_x_copy, numMV, "L", "N"); + KokkosSparse::spmv("N", alpha, lower_part, b_x_copy, beta, b_y); + check_trsv_mv(lower_part, b_x, b_y, b_x_copy, numMV, "L", "N"); - KokkosSparse::spmv("T", alpha, lower_part, b_x_copy, beta, b_y); - Test::check_trsv_mv(lower_part, b_x, b_y, b_x_copy, numMV, "L", "T"); - // typedef typename Kokkos::View indexview; + if (!UseBlocks) { + KokkosSparse::spmv("T", alpha, lower_part, b_x_copy, beta, b_y); + check_trsv_mv(lower_part, b_x, b_y, b_x_copy, numMV, "L", "T"); + } - crsMat_t upper_part = - KokkosSparse::Impl::kk_generate_triangular_sparse_matrix( - 'U', numRows, numCols, nnz, row_size_variance, bandwidth); + auto upper_part = get_LU( + 'U', numRows, nnz, row_size_variance, bandwidth, block_size); - Test::shuffleMatrixEntries(upper_part.graph.row_map, upper_part.graph.entries, - upper_part.values); + Test::shuffleMatrixEntries(upper_part.graph.row_map, + upper_part.graph.entries, upper_part.values, + block_size); - KokkosSparse::spmv("N", alpha, upper_part, b_x_copy, beta, b_y); - Test::check_trsv_mv(upper_part, b_x, b_y, b_x_copy, numMV, "U", "N"); + KokkosSparse::spmv("N", alpha, upper_part, b_x_copy, beta, b_y); + check_trsv_mv(upper_part, b_x, b_y, b_x_copy, numMV, "U", "N"); + + if (!UseBlocks) { + KokkosSparse::spmv("T", alpha, upper_part, b_x_copy, beta, b_y); + check_trsv_mv(upper_part, b_x, b_y, b_x_copy, numMV, "U", "T"); + } + } +}; - KokkosSparse::spmv("T", alpha, upper_part, b_x_copy, beta, b_y); - Test::check_trsv_mv(upper_part, b_x, b_y, b_x_copy, numMV, "U", "T"); +} // namespace Test + +template +void test_trsv_mv() { + using TestStruct = Test::TrsvTest; + TestStruct::template test_trsv_mv(1000, 1000 * 30, 200, 10, 1); + TestStruct::template test_trsv_mv(800, 800 * 30, 100, 10, 5); + TestStruct::template test_trsv_mv(400, 400 * 20, 100, 5, 10); + TestStruct::template test_trsv_mv(1000, 1000 * 30, 200, 10, 1); + TestStruct::template test_trsv_mv(800, 800 * 30, 100, 10, 5); + TestStruct::template test_trsv_mv(400, 400 * 20, 100, 5, 10); } // Note BMK 7-22: the matrix generator used by this test always @@ -126,12 +168,7 @@ void test_trsv_mv(lno_t numRows, size_type nnz, lno_t bandwidth, TEST_F( \ TestCategory, \ sparse##_##trsv_mv##_##SCALAR##_##ORDINAL##_##OFFSET##_##LAYOUT##_##DEVICE) { \ - test_trsv_mv( \ - 1000, 1000 * 30, 200, 10, 1); \ - test_trsv_mv( \ - 800, 800 * 30, 100, 10, 5); \ - test_trsv_mv( \ - 400, 400 * 20, 100, 5, 10); \ + test_trsv_mv(); \ } #if defined(KOKKOSKERNELS_INST_LAYOUTLEFT) || \ diff --git a/sparse/unit_test/Test_vector_fixtures.hpp b/sparse/unit_test/Test_vector_fixtures.hpp new file mode 100644 index 0000000000..2037a5485e --- /dev/null +++ b/sparse/unit_test/Test_vector_fixtures.hpp @@ -0,0 +1,212 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 4.0 +// Copyright (2022) National Technology & Engineering +// Solutions of Sandia, LLC (NTESS). +// +// Under the terms of Contract DE-NA0003525 with NTESS, +// the U.S. Government retains certain rights in this software. +// +// Part of Kokkos, under the Apache License v2.0 with LLVM Exceptions. +// See https://kokkos.org/LICENSE for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//@HEADER + +#ifndef _TEST_VECTOR_FIXTURES_HPP +#define _TEST_VECTOR_FIXTURES_HPP + +#include + +#include + +/** + * API for working with 2D vectors of small matrices for testing. + */ + +namespace Test { + +template +scalar_t KEEP_ZERO() { + return scalar_t(-9999.0); +} + +template +void compress_matrix( + MapT& map, EntriesT& entries, ValuesT& values, + const std::vector>& + fixture) { + using size_type = typename MapT::non_const_value_type; + using scalar_t = typename ValuesT::non_const_value_type; + + const scalar_t ZERO = scalar_t(0); + + const size_type nrows = fixture.size(); + const size_type ncols = fixture[0].size(); + + // Count fixture nnz's + size_type nnz = 0; + for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { + for (size_type col_idx = 0; col_idx < nrows; ++col_idx) { + if (fixture[row_idx][col_idx] != ZERO) { + ++nnz; + } + } + } + + // Allocate device CRS views + Kokkos::resize(map, (CSC ? ncols : nrows) + 1); + Kokkos::resize(entries, nnz); + Kokkos::resize(values, nnz); + + // Create host mirror views for CRS + auto hmap = Kokkos::create_mirror_view(map); + auto hentries = Kokkos::create_mirror_view(entries); + auto hvalues = Kokkos::create_mirror_view(values); + + // Compress into CRS (host views) + size_type curr_nnz = 0; + + const size_type num_outer = (CSC ? ncols : nrows); + const size_type num_inner = (CSC ? nrows : ncols); + for (size_type outer_idx = 0; outer_idx < num_outer; ++outer_idx) { + for (size_type inner_idx = 0; inner_idx < num_inner; ++inner_idx) { + const size_type row = CSC ? inner_idx : outer_idx; + const size_type col = CSC ? outer_idx : inner_idx; + const auto val = fixture[row][col]; + if (val != ZERO) { + hentries(curr_nnz) = inner_idx; + hvalues(curr_nnz) = val == KEEP_ZERO() ? ZERO : val; + ++curr_nnz; + } + hmap(outer_idx + 1) = curr_nnz; + } + } + + // Copy host CRS views to device CRS views + Kokkos::deep_copy(map, hmap); + Kokkos::deep_copy(entries, hentries); + Kokkos::deep_copy(values, hvalues); +} + +template +std::vector> +decompress_matrix(const RowMapT& row_map, const EntriesT& entries, + const ValuesT& values) { + using size_type = typename RowMapT::non_const_value_type; + using scalar_t = typename ValuesT::non_const_value_type; + + const scalar_t ZERO = scalar_t(0); + + const size_type nrows = row_map.size() - 1; + std::vector> result; + result.resize(nrows); + for (auto& row : result) { + row.resize(nrows, ZERO); + } + + auto hrow_map = Kokkos::create_mirror_view(row_map); + auto hentries = Kokkos::create_mirror_view(entries); + auto hvalues = Kokkos::create_mirror_view(values); + Kokkos::deep_copy(hrow_map, row_map); + Kokkos::deep_copy(hentries, entries); + Kokkos::deep_copy(hvalues, values); + + for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { + const size_type row_nnz_begin = hrow_map(row_idx); + const size_type row_nnz_end = hrow_map(row_idx + 1); + for (size_type row_nnz = row_nnz_begin; row_nnz < row_nnz_end; ++row_nnz) { + const auto col_idx = hentries(row_nnz); + const scalar_t value = hvalues(row_nnz); + if (CSC) { + result[col_idx][row_idx] = value; + } else { + result[row_idx][col_idx] = value; + } + } + } + + return result; +} + +template +std::vector> +decompress_matrix(const RowMapT& row_map, const EntriesT& entries, + const ValuesT& values, + typename RowMapT::const_value_type block_size) { + using size_type = typename RowMapT::non_const_value_type; + using scalar_t = typename ValuesT::non_const_value_type; + + const scalar_t ZERO = scalar_t(0); + + const size_type nbrows = row_map.extent(0) - 1; + const size_type nrows = nbrows * block_size; + const size_type block_items = block_size * block_size; + std::vector> result; + result.resize(nrows); + for (auto& row : result) { + row.resize(nrows, ZERO); + } + + auto hrow_map = Kokkos::create_mirror_view(row_map); + auto hentries = Kokkos::create_mirror_view(entries); + auto hvalues = Kokkos::create_mirror_view(values); + Kokkos::deep_copy(hrow_map, row_map); + Kokkos::deep_copy(hentries, entries); + Kokkos::deep_copy(hvalues, values); + + for (size_type row_idx = 0; row_idx < nbrows; ++row_idx) { + const size_type row_nnz_begin = hrow_map(row_idx); + const size_type row_nnz_end = hrow_map(row_idx + 1); + for (size_type row_nnz = row_nnz_begin; row_nnz < row_nnz_end; ++row_nnz) { + const auto col_idx = hentries(row_nnz); + for (size_type i = 0; i < block_size; ++i) { + const size_type unc_row_idx = row_idx * block_size + i; + for (size_type j = 0; j < block_size; ++j) { + const size_type unc_col_idx = col_idx * block_size + j; + result[unc_row_idx][unc_col_idx] = + hvalues(row_nnz * block_items + i * block_size + j); + } + } + } + } + + return result; +} + +template +void check_matrix( + const std::string& name, const RowMapT& row_map, const EntriesT& entries, + const ValuesT& values, + const std::vector>& + expected) { + using size_type = typename RowMapT::non_const_value_type; + + const auto decompressed_mtx = decompress_matrix(row_map, entries, values); + + const size_type nrows = row_map.size() - 1; + for (size_type row_idx = 0; row_idx < nrows; ++row_idx) { + for (size_type col_idx = 0; col_idx < nrows; ++col_idx) { + EXPECT_NEAR(expected[row_idx][col_idx], + decompressed_mtx[row_idx][col_idx], 0.01) + << "Failed check is: " << name << "[" << row_idx << "][" << col_idx + << "]"; + } + } +} + +template +void print_matrix(const std::vector>& matrix) { + for (const auto& row : matrix) { + for (const auto& item : row) { + std::printf("%.5f ", item); + } + std::cout << std::endl; + } +} + +} // namespace Test + +#endif // _TEST_VECTOR_FIXTURES_HPP diff --git a/test_common/KokkosKernels_TestUtils.hpp b/test_common/KokkosKernels_TestUtils.hpp index 236bcdd1c8..232b66242a 100644 --- a/test_common/KokkosKernels_TestUtils.hpp +++ b/test_common/KokkosKernels_TestUtils.hpp @@ -776,9 +776,11 @@ class RandCsMatrix { MapViewTypeD get_map() { return __getter_copy_helper(__map_d); } }; -/// \brief Randomly shuffle the entries in each row (col) of a Crs (Ccs) matrix. +/// \brief Randomly shuffle the entries in each row (col) of a Crs (Ccs) or Bsr +/// matrix. template -void shuffleMatrixEntries(Rowptrs rowptrs, Entries entries, Values values) { +void shuffleMatrixEntries(Rowptrs rowptrs, Entries entries, Values values, + const size_t block_size = 1) { using size_type = typename Rowptrs::non_const_value_type; using ordinal_type = typename Entries::value_type; auto rowptrsHost = @@ -789,6 +791,7 @@ void shuffleMatrixEntries(Rowptrs rowptrs, Entries entries, Values values) { Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), values); ordinal_type numRows = rowptrsHost.extent(0) ? (rowptrsHost.extent(0) - 1) : 0; + const size_t block_items = block_size * block_size; for (ordinal_type i = 0; i < numRows; i++) { size_type rowBegin = rowptrsHost(i); size_type rowEnd = rowptrsHost(i + 1); @@ -796,7 +799,9 @@ void shuffleMatrixEntries(Rowptrs rowptrs, Entries entries, Values values) { ordinal_type swapRange = rowEnd - j; size_type swapOffset = j + (rand() % swapRange); std::swap(entriesHost(j), entriesHost(swapOffset)); - std::swap(valuesHost(j), valuesHost(swapOffset)); + std::swap_ranges(valuesHost.data() + j * block_items, + valuesHost.data() + (j + 1) * block_items, + valuesHost.data() + swapOffset * block_items); } } Kokkos::deep_copy(entries, entriesHost); diff --git a/test_common/Test_HIP.hpp b/test_common/Test_HIP.hpp index c9e02698c5..dfb8e1d687 100644 --- a/test_common/Test_HIP.hpp +++ b/test_common/Test_HIP.hpp @@ -31,7 +31,18 @@ class hip : public ::testing::Test { static void TearDownTestCase() {} }; +using HIPSpaceDevice = Kokkos::Device; +using HIPManagedSpaceDevice = + Kokkos::Device; + #define TestCategory hip -#define TestDevice Kokkos::HIP + +// Prefer for any testing where only one exec space is used +#if defined(KOKKOSKERNELS_INST_MEMSPACE_HIPMANAGEDSPACE) && \ + !defined(KOKKOSKERNELS_INST_MEMSPACE_HIPSPACE) +#define TestDevice HIPManagedSpaceDevice +#else +#define TestDevice HIPSpaceDevice +#endif #endif // TEST_HIP_HPP