Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building for many CUDA archs leads to linker errors #1734

Open
lahwaacz opened this issue Nov 27, 2024 · 8 comments
Open

Building for many CUDA archs leads to linker errors #1734

lahwaacz opened this issue Nov 27, 2024 · 8 comments

Comments

@lahwaacz
Copy link
Contributor

While building a package for Arch Linux, I found that enabling all CUDA architectures (-DGINKGO_CUDA_ARCHITECTURES="All") leads to this error on the final link:

FAILED: lib/libginkgo_cuda.so.1.9.0
: && /usr/bin/c++ -fPIC -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection         -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/ginkgo-hpc-git/src=/usr/src/debug/ginkgo-hpc-git -flto=auto  -Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now          -Wl,-z,pack-relative-relocs -flto=auto   -Wl,--dependency-file,cuda/CMakeFiles/ginkgo_cuda.dir/link.d -shared -Wl,-soname,libginkgo_cuda.so.1.9.0 -o lib/libginkgo_cuda.so.1.9.0 devices/cuda/CMakeFiles/ginkgo_cuda_device.dir/executor.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/device.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/exception.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/executor.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/memory.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/nvtx.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/scoped_device_id.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/stream.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/timer.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/base/version.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.0.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.3.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.4.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.5.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.6.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.7.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.8.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.9.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.10.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.11.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.12.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.13.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.14.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.15.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.16.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.17.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.18.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.19.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/csr_kernels.instantiate.20.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/fbcsr_kernels.instantiate.0.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/fbcsr_kernels.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/fbcsr_kernels.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/matrix/fft_kernels.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/preconditioner/batch_jacobi_kernels.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_kernels.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.0.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.3.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.4.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.5.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.6.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.7.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.8.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.9.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.10.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.0.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_bicgstab_launch.instantiate.1.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_kernels.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.0.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.3.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.4.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.5.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.6.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.0.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/batch_cg_launch.instantiate.1.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/lower_trs_kernels.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/solver/upper_trs_kernels.cu.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/base/device_matrix_data_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/base/index_set_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/components/absolute_array_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/components/fill_array_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/components/format_conversion_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/components/precision_conversion_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/components/reduce_array_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/distributed/partition_helpers_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/distributed/partition_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/coo_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/csr_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/ell_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/hybrid_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/permutation_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/scaled_permutation_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/sellp_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/sparsity_csr_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/diagonal_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/multigrid/pgm_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/preconditioner/jacobi_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/bicg_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/bicgstab_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/cg_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/cgs_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/common_gmres_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/fcg_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/gcr_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/gmres_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/solver/ir_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/unified/matrix/dense_kernels.instantiate.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/base/batch_multi_vector_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/base/device_matrix_data_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/base/index_set_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/components/prefix_sum_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/distributed/index_map_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/distributed/matrix_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/distributed/partition_helpers_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/distributed/partition_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/distributed/vector_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/cholesky_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/factorization_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/ic_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/ilu_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/lu_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ic_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ict_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilu_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilut_approx_filter_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilut_filter_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilut_select_common.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilut_select_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilut_spgeam_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/factorization/par_ilut_sweep_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/batch_csr_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/batch_dense_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/batch_ell_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/coo_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/dense_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/diagonal_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/ell_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/sellp_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/matrix/sparsity_csr_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/multigrid/pgm_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/isai_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/sor_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/reorder/rcm_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/solver/cb_gmres_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/solver/idr_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/solver/multigrid_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/stop/criterion_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/stop/residual_norm_kernels.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.1.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.2.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.4.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.4.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.4.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.8.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.8.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.8.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.13.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.13.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.13.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.16.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.16.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.16.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_generate_kernels.instantiate.32.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_simple_apply_kernels.instantiate.32.cpp.o cuda/CMakeFiles/ginkgo_cuda.dir/__/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernels.instantiate.32.cpp.o -L/opt/cuda/targets/x86_64-linux/lib/stubs   -L/opt/cuda/targets/x86_64-linux/lib   -L/usr/lib/gcc/x86_64-pc-linux-gnu/13.3.1 -Wl,-rpath,/opt/cuda/targets/x86_64-linux/lib:/build/ginkgo-hpc-git/src/build-cuda/lib:  /opt/cuda/targets/x86_64-linux/lib/libcudart.so  /opt/cuda/targets/x86_64-linux/lib/libcublas.so  /opt/cuda/targets/x86_64-linux/lib/libcusparse.so  /opt/cuda/targets/x86_64-linux/lib/libcurand.so  /opt/cuda/targets/x86_64-linux/lib/libcufft.so  lib/libginkgo_device.so.1.9.0  -ldl  -ldl  /usr/lib/librt.a  /opt/cuda/targets/x86_64-linux/lib/libcublasLt.so  /opt/cuda/targets/x86_64-linux/lib/libculibos.a  /opt/cuda/targets/x86_64-linux/lib/libnvJitLink.so  -lcudadevrt  -lcudart_static  -lrt  -lpthread  -ldl && :
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../lib/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
/tmp/cc6UKHTo.ltrans0.ltrans.o: in function `std::_Function_handler<void (cublasContext*), gko::CudaExecutor::init_handles()::{lambda(cublasContext*)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation)':
/usr/include/c++/14.2.1/bits/std_function.h:274:(.text+0xfb): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
/tmp/cc6UKHTo.ltrans0.ltrans.o: in function `std::_Function_handler<void (cusparseContext*), gko::CudaExecutor::init_handles()::{lambda(cusparseContext*)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation)':
/usr/include/c++/14.2.1/bits/std_function.h:274:(.text+0x13b): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
/tmp/cc6UKHTo.ltrans0.ltrans.o: in function `nvtxEtiGetModuleFunctionTable_v3':
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:401:(.text+0x243): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:404:(.text+0x273): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:424:(.text+0x283): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:412:(.text+0x293): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:416:(.text+0x2a3): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:420:(.text+0x2b3): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/tmp/cc6UKHTo.ltrans0.ltrans.o: in function `nvtxGetExportTable_v3':
/opt/cuda/targets/x86_64-linux/include/nvtx3/nvtxDetail/nvtxImpl.h:443:(.text+0x2d7): relocation truncated to fit: R_X86_64_PC32 against symbol `nvtxGlobals_v3' defined in .data.rel.local section in /tmp/cc6UKHTo.ltrans0.ltrans.o
/tmp/cc6UKHTo.ltrans0.ltrans.o: in function `gko::CudaExecutor::get_master()':
/usr/include/c++/14.2.1/ext/atomicity.h:52:(.text+0x332): additional relocation overflows omitted from the output
lib/libginkgo_cuda.so.1.9.0: PC-relative offset overflow in PLT entry for `_ZN3gko7kernels4cuda10run_kernelI17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSt10shared_ptrIKNS_12CudaExecutorEEPKlPKNS_6matrix5DenseIfEEPSD_EXadL_ZNS1_5dense12symm_permuteIflEEvS8_PKT0_PKNSC_IT_EEPSP_EELj1EEJEEJRSF_RSA_RSG_EEEvS8_SO_NS_3dimILm2EmEEDpOT0_'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

For 1.8.0 we worked around it by omitting a few architectures:

# In general, we want to list all real archs (sm_XX) and the latest virtual arch (compute_XX) for future PTX compatibility.
# Valid values can be discovered from nvcc --help
# Compiling Ginkgo for all real architectures triggers linker limits (2 GB binary size). So let's omit 52, 53, 62, 72 from the list.
local _cuda_archs="50;60;61;70;75;80;86;87;89;90;90a;90a-virtual"

cmake -DCMAKE_CUDA_ARCHITECTURES="$_cuda_archs" ...

But building the develop branch now fails again with the same trick... Any ideas? Maybe split libginkgo_cuda.so to several smaller libs?

@yhmtsai
Copy link
Member

yhmtsai commented Nov 27, 2024

if reducing more architectures, will it still happen?

@upsj
Copy link
Member

upsj commented Nov 27, 2024

If you remove all references to the NVTX library (including the header #include) from cuda/base/nvtx.cpp by emptying all functions, does the issue still appear?

@lahwaacz
Copy link
Contributor Author

if reducing more architectures, will it still happen?

Building for just one architecture works, but that does not help. The intention is to build a general binary package that can be used efficiently on any GPU architecture. Also, I've found a reduced set of archs that works for Ginkgo 1.8.0 but will lead to the same error on the next release (currently develop branch), and it is not practical to reduce architectures again and again for new releases.

If you remove all references to the NVTX library (including the header #include) from cuda/base/nvtx.cpp by emptying all functions, does the issue still appear?

It is not a problem with one specific library. Just tried to build it on a different system (without any code changes) and a different name appears in the output:

/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../lib/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x3): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0xa): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .nvFatBinSegment section in lib/libginkgo_cuda.so.1.9.0
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x16): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `_ITM_deregisterTMCloneTable'
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x33): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x3a): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .nvFatBinSegment section in lib/libginkgo_cuda.so.1.9.0
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x57): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `_ITM_registerTMCloneTable'
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x76): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x81): relocation truncated to fit: R_X86_64_GOTPCREL against symbol `__cxa_finalize@@GLIBC_2.2.5' defined in .text section in /usr/lib/libc.so.6
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x8e): relocation truncated to fit: R_X86_64_PC32 against symbol `__dso_handle' defined in .data.rel.local section in /usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o
/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/crtbeginS.o:(.text+0x94): additional relocation overflows omitted from the output
lib/libginkgo_cuda.so.1.9.0: PC-relative offset overflow in PLT entry for `_ZN3gko7kernels4cuda10run_kernelI17__nv_dl_wrapper_tI11__nv_dl_tagIPFvSt10shared_ptrIKNS_12CudaExecutorEEPKlPKNS_6matrix5DenseIfEEPSD_EXadL_ZNS1_5dense12symm_permuteIflEEvS8_PKT0_PKNSC_IT_EEPSP_EELj1EEJEEJRSF_RSA_RSG_EEEvS8_SO_NS_3dimILm2EmEEDpOT0_'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

@yhmtsai
Copy link
Member

yhmtsai commented Dec 4, 2024

Hi @lahwaacz , I can reproduce it on my side. Unfortunately, it does not work with -mcmodel=medium and -mcmodel=large. I also check some libraries with this issue and they also suggest to reduce the arch list, but some of them split their library a bit later. Someone already complained we have many libraries already.
The attachment is the patch to split if you can apply it to the new release when building the package.
Unfortunately, we will not have our decision for this release because the release is under processes now and it will require some feedback from the others.

split_cuda_library.patch

@lahwaacz
Copy link
Contributor Author

FYI, I just had to drop 5 more architectures in order to be able to build Ginkgo 1.9.0 with CUDA on Arch Linux. I can build just for "50;60;70;80;90;90a;90a-virtual" and adding anything more will trigger this issue. For comparison, we could build Ginkgo 1.8.0 with for "50;60;61;70;75;80;86;87;89;90;90a;90a-virtual".

I have not tried the patch for splitting yet...

@yhmtsai
Copy link
Member

yhmtsai commented Dec 12, 2024

Hi @lahwaacz
Thanks for trying that.
CUDA does not split it by architectures internally, so we can only split it from libraries unfortunatly.
We add more functionality and support half for most routines, which means roughly 1.5x functions.
You can disable it by -GINKGO_ENABLE_HALF=OFF, so you might not need to drop many archs.
I have used the split patch to compile ginkgo with the same arch-list after half precision prs are merged.

@lahwaacz
Copy link
Contributor Author

With -DGINKGO_ENABLE_HALF=OFF, I still had to drop 3 more architectures compared to 1.8.0 so now I'm trying the patch to split the cuda library...

Note that when building for "50;52;53;60;61;62;70;72;75;80;86;87;89;90;90a;90a-virtual", even this fails on linking lib/libginkgo_cuda_4.so.1.9.0 which is still too big. I had to split it further into two libraries:

target_sources(ginkgo_cuda_4
    PRIVATE
    ${BATCH_BICGSTAB_INSTANTIATE1}
    ${BATCH_BICGSTAB_INSTANTIATE2}
)
target_sources(ginkgo_cuda_5
    PRIVATE
    ${BATCH_CG_INSTANTIATE1}
    ${BATCH_CG_INSTANTIATE2}
)

So in the end, ginkgo-hpc-cuda 1.9.0-1 in Arch Linux has the following library sizes:

-rwxr-xr-x 1 root root 184M Dec 14 07:22 /usr/lib/libginkgo_cuda_2.so.1.9.0*
-rwxr-xr-x 1 root root 767M Dec 14 07:22 /usr/lib/libginkgo_cuda_3.so.1.9.0*
-rwxr-xr-x 1 root root 1.6G Dec 14 07:22 /usr/lib/libginkgo_cuda_4.so.1.9.0*
-rwxr-xr-x 1 root root 834M Dec 14 07:22 /usr/lib/libginkgo_cuda_5.so.1.9.0*
-rwxr-xr-x 1 root root 1.3G Dec 14 07:22 /usr/lib/libginkgo_cuda.so.1.9.0*
-rwxr-xr-x 1 root root  14K Dec 14 07:22 /usr/lib/libginkgo_device.so.1.9.0*
-rwxr-xr-x 1 root root 863K Dec 14 07:22 /usr/lib/libginkgo_dpcpp.so.1.9.0*
-rwxr-xr-x 1 root root 871K Dec 14 07:22 /usr/lib/libginkgo_hip.so.1.9.0*
-rwxr-xr-x 1 root root  20M Dec 14 07:22 /usr/lib/libginkgo_omp.so.1.9.0*
-rwxr-xr-x 1 root root 4.9M Dec 14 07:22 /usr/lib/libginkgo_reference.so.1.9.0*
-rwxr-xr-x 1 root root  53M Dec 14 07:22 /usr/lib/libginkgo.so.1.9.0*

@yhmtsai
Copy link
Member

yhmtsai commented Dec 16, 2024

Thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants