Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMGCL on AMD GPU #1

Open
pelyakim opened this issue Mar 22, 2023 · 23 comments
Open

AMGCL on AMD GPU #1

pelyakim opened this issue Mar 22, 2023 · 23 comments

Comments

@pelyakim
Copy link

Hello,
I would like to know if you have already used your AMGCL library on AMD graphics cards? It would be for the resolution of a pseudo Poisson equation in a fluid mechanics code (finite volume) with curvilinear structured meshes with a resolution on CPU in parallel (MPI) and on GPU (for the Poisson solver). Thank you very much for your answer, Pierre

@ddemidov
Copy link
Owner

Hello Pierre,

Yes, you can use amgcl with AMD cards using OpenCL via vexcl backend. There is an example (of using MPI/vexcl) here: https://amgcl.readthedocs.io/en/latest/tutorial/SerenaMPI.html#id2

@pelyakim
Copy link
Author

Thanks for your answer.
My partitioning is already done, and I stored it in a vector the size of my domain (nxnynz in 3D). Is it possible to use it in this state ? Moreover, I would like to use a PCG preconditioner for example and as AMG solver, is it possible ? Thanks for your answers. Pierre

@ddemidov
Copy link
Owner

It should be possible. See more details on partitioning and MPI here: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html. There is an example of using PCG here: https://amgcl.readthedocs.io/en/latest/tutorial/NullspaceMPI.html

@pelyakim
Copy link
Author

Thank you. Can you explain me what the make_shared function does and where the sources are? Thanks

@ddemidov
Copy link
Owner

@pelyakim
Copy link
Author

Sorry, i would like to know for the distributed_matrix function. Thanks

@pelyakim
Copy link
Author

Hello, I could not compile AMGCL with VexCl, I have the impression that it does not find VexCL. Also, I don't know how to install VexCL. Could you help me with this installation. Then I would like to run the tutorials, especially the Poisson problem in parallel with VexCL on AMD graphics cards, for that I can't find the matrix poisson3Db.bin, could you tell me where I can find it?
Thanks for all these indications. Sincerely, Pierre

@ddemidov
Copy link
Owner

Try the following in a separate folder:

git clone https://github.com/ddemidov/vexcl
cmake -Bvexcl_build vexcl

After this, try to reconfigure amgcl. It should find vexcl now.

@pelyakim
Copy link
Author

Thanks, it did find vexcl when running cmake. Also, I can't find the fish3Db.bin file to test the fish3Db_mpi_vexcl_cl executable from the Fish Problem tutorial in mpi, could you tell me where I can find it?
Thanks, Pierre

@ddemidov
Copy link
Owner

You can convert the mtx file to bin file using examples/mm2bin utility, search for 'mm2bin' on this page: https://amgcl.readthedocs.io/en/latest/tutorial/Serena.html?highlight=mm2bin#structural-problem.

There is a link to download the Poisson3Db matrix here: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3Db.html

@pelyakim
Copy link
Author

Hi, thanks but I have a problem with the AMGCL build: I can't link the scotch installation to AMGCL. When I run the command to build the Makefile :cmake -DCMAKE_INSTALL_PREFIX=/lus/home/pelyakime/AMGCL/scotch-v7.0.1/install -DCMAKE_BUILD_TYPE=Release .. I have the impression that it does not find the scotch library.
How can I do ? Thanks for your help

@ddemidov
Copy link
Owner

ddemidov commented Apr 4, 2023

ddemidov/amgcl#255

@pelyakim
Copy link
Author

pelyakim commented Apr 6, 2023

Thank you very much for your help, I don't have any problem with libraries anymore and I can test the executables of the poisson3Db tutorial : poisson3Db and poisson3Db_mpi works without problem. But with poisson3Db_mpi_vexcl I have a segmentation fault problem. I'm on an AMD architecture with AMD graphics cards, and I hope to use OpenCL to compute on AMD GPUs (I compute on the Adastra machine in France).
I managed to recompile the sources (pois3Db_mpi_vexcl.cpp) and I have the impression that I have a problem as soon as I get to this place of the code :
```
for(int i = 0; i < world.size; ++i) {
// unclutter the output:
if (i == world.rank)
std::cout << world.rank << ":" << ctx.queue(0) << std::endl;
MPI_Barrier(world);
}

I'm not sure where the problem could be coming from, would you have any idea? 
Thanks a lot for your help. Pierre

@ddemidov
Copy link
Owner

ddemidov commented Apr 6, 2023

Looks like you don't have any GPUs in the context. What does vexcl/examples/devlist output on your system?

@pelyakim
Copy link
Author

pelyakim commented Apr 7, 2023

Hello,
when I put this command in my script slurm I have :

Currently Loaded Modules:
  1) craype-network-ofi        9) cray-mpich/8.1.24
  2) craype-x86-trento        10) craype/2.7.19
  3) craype-accel-amd-gfx90a  11) perftools-base/23.02.0
  4) libfabric/1.15.2.0       12) rocm/5.2.0
  5) PrgEnv-cray/8.3.3        13) cpe/23.02
  6) cce/15.0.1               14) CPE-23.02-cce-15.0.1-GPU-softs
  7) cray-dsmml/0.2.2         15) scotch/6.1.3-mpi
  8) cray-libsci/23.02.1.1    16) boost/1.81.0-mpi-python3

 

	linux-vdso.so.1 (0x00007ffe9ab92000)
	libOpenCL.so.1 => /opt/rocm-5.2.0/lib/libOpenCL.so.1 (0x000015321c880000)
	libboost_filesystem-mt-x64.so.1.81.0 => /opt/software/gaia/dev/1.0.3-e7de077a/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_p/boost-1.81.0-cce-15.0.1-cn4o/lib/libboost_filesystem-mt-x64.so.1.81.0 (0x000015321cc86000)
	libamdhip64.so.5 => /opt/rocm-5.2.0/hip/lib/libamdhip64.so.5 (0x000015321b98b000)
	libmpi_cray.so.12 => /opt/cray/pe/lib64/libmpi_cray.so.12 (0x0000153218ffd000)
	libmpi_gtl_hsa.so.0 => /opt/cray/pe/lib64/libmpi_gtl_hsa.so.0 (0x0000153218d9a000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000153218b96000)
	libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x0000153218770000)
	libfi.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libfi.so.1 (0x00001532181cb000)
	libquadmath.so.0 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libquadmath.so.0 (0x0000153217f84000)
	libmodules.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libmodules.so.1 (0x000015321cc5d000)
	libcraymath.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libcraymath.so.1 (0x000015321cb74000)
	libf.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libf.so.1 (0x000015321cae0000)
	libu.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libu.so.1 (0x0000153217e7b000)
	libcsup.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libcsup.so.1 (0x000015321cad7000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000153217c5b000)
	libm.so.6 => /lib64/libm.so.6 (0x00001532178d9000)
	libunwind.so.1 => /opt/cray/pe/cce/15.0.1/cce-clang/x86_64/lib/libunwind.so.1 (0x000015321cac0000)
	libc.so.6 => /lib64/libc.so.6 (0x0000153217514000)
	librt.so.1 => /lib64/librt.so.1 (0x000015321730c000)
	libamd_comgr.so.2 => /opt/rocm-5.2.0/lib/libamd_comgr.so.2 (0x000015320fc5c000)
	libhsa-runtime64.so.1 => /opt/rocm-5.2.0/lib/libhsa-runtime64.so.1 (0x000015320f80f000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x000015320f603000)
	libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x000015320f3e4000)
	/lib64/ld-linux-x86-64.so.2 (0x000015321ca88000)
	libfabric.so.1 => /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1 (0x000015320f0f1000)
	libpmi.so.0 => /opt/cray/pe/lib64/libpmi.so.0 (0x000015320eeef000)
	libpmi2.so.0 => /opt/cray/pe/lib64/libpmi2.so.0 (0x000015320ecce000)
	libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x000015320e801000)
	libz.so.1 => /lib64/libz.so.1 (0x000015320e5e9000)
	libtinfo.so.6 => /lib64/libtinfo.so.6 (0x000015320e3bc000)
	libelf.so.1 => /lib64/libelf.so.1 (0x000015320e1a3000)
	libdrm.so.2 => /opt/amdgpu/lib64/libdrm.so.2 (0x000015320df8f000)
	libdrm_amdgpu.so.1 => /opt/amdgpu/lib64/libdrm_amdgpu.so.1 (0x000015320dd83000)
	libcxi.so.1 => /lib64/libcxi.so.1 (0x000015320db5e000)
	libcurl.so.4 => /lib64/libcurl.so.4 (0x000015320d8d0000)
	libjson-c.so.4 => /lib64/libjson-c.so.4 (0x000015320d6c0000)
	libatomic.so.1 => /opt/cray/pe/gcc-libs/libatomic.so.1 (0x000015320d4b7000)
	libpals.so.0 => /opt/cray/pe/lib64/libpals.so.0 (0x000015320d2af000)
	libnghttp2.so.14 => /lib64/libnghttp2.so.14 (0x000015320d088000)
	libidn2.so.0 => /lib64/libidn2.so.0 (0x000015320ce6a000)
	libssh.so.4 => /lib64/libssh.so.4 (0x000015320cbfb000)
	libpsl.so.5 => /lib64/libpsl.so.5 (0x000015320c9ea000)
	libssl.so.1.1 => /lib64/libssl.so.1.1 (0x000015320c754000)
	libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x000015320c26b000)
	libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x000015320c016000)
	libkrb5.so.3 => /lib64/libkrb5.so.3 (0x000015320bd2c000)
	libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x000015320bb15000)
	libcom_err.so.2 => /lib64/libcom_err.so.2 (0x000015320b911000)
	libldap-2.4.so.2 => /lib64/libldap-2.4.so.2 (0x000015320b6c0000)
	liblber-2.4.so.2 => /lib64/liblber-2.4.so.2 (0x000015320b4b0000)
	libbrotlidec.so.1 => /lib64/libbrotlidec.so.1 (0x000015320b2a3000)
	libjansson.so.4 => /lib64/libjansson.so.4 (0x000015320b095000)
	libunistring.so.2 => /lib64/libunistring.so.2 (0x000015320ad14000)
	libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x000015320ab01000)
	libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x000015320a8fd000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x000015320a6e6000)
	libsasl2.so.3 => /lib64/libsasl2.so.3 (0x000015320a4c8000)
	libbrotlicommon.so.1 => /lib64/libbrotlicommon.so.1 (0x000015320a2a7000)
	libselinux.so.1 => /lib64/libselinux.so.1 (0x000015320a07b000)
	libcrypt.so.1 => /lib64/libcrypt.so.1 (0x0000153209e52000)
	libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x0000153209bce000)
OpenCL devices:

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

cpu-bind=MASK - g1245, task  0  0 [4127093]: mask 0xffffffff00000000ffffffff set
cpu-bind=MASK - g1245, task  1  1 [4127094]: mask 0xffffffff00000000ffffffff00000000 set
srun: error: g1245: task 1: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=116457.0
slurmstepd: error: *** STEP 116457.0 ON g1245 CANCELLED AT 2023-04-07T10:30:51 ***
srun: error: g1245: task 0: Terminated
srun: Force Terminated StepId=116457.0

@pelyakim
Copy link
Author

pelyakim commented Apr 7, 2023

I don't know if this has anything to do with it but when building vexcl, it doesn't find OPENCL_HPP

/vexcl_cce$ cmake -Bvexcl_build -DVEXCL_BUILD_TESTS=ON -DVEXCL_BUILD_EXAMPLES=ON
-- No build type selected, default to RelWithDebInfo
-- The C compiler identification is Clang 15.0.6
-- The CXX compiler identification is Clang 15.0.6
-- Cray Programming Environment 2.7.19 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.19/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Cray Programming Environment 2.7.19 CXX
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.19/bin/CC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /opt/software/gaia/dev/1.0.3-e7de077a/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_p/boost-1.81.0-cce-15.0.1-cn4o/lib/cmake/Boost-1.81.0/BoostConfig.cmake (found version "1.81.0") found components: chrono date_time filesystem program_options system thread unit_test_framework 
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /opt/rocm-5.2.0/lib/libOpenCL.so (found version "2.2") 
--  -- OPENCL_HPP-NOTFOUND --
-- Found VexCL::OpenCL
-- Found VexCL::Compute
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
-- Found OpenMP_C: -fopenmp (found version "5.0") 
-- Found OpenMP_CXX: -fopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Found VexCL::JIT
-- Selected backend: OpenCL
-- Configuring done
-- Generating done
-- Build files have been written to:

And when I build Tests and examples there are one error :

vexcl_build$ make -j
[  1%] Building CXX object tests/CMakeFiles/fft.dir/fft.cpp.o
[  4%] Building CXX object tests/CMakeFiles/context.dir/context.cpp.o
[  5%] Building CXX object tests/CMakeFiles/scan.dir/scan.cpp.o
[  5%] Building CXX object tests/CMakeFiles/vector_io.dir/vector_io.cpp.o
[  7%] Building CXX object tests/CMakeFiles/vector_pointer.dir/vector_pointer.cpp.o
[ 10%] Building CXX object tests/CMakeFiles/multi_array.dir/multi_array.cpp.o
[ 10%] Building CXX object tests/CMakeFiles/vector_view.dir/vector_view.cpp.o
[ 13%] Building CXX object tests/CMakeFiles/temporary.dir/temporary.cpp.o
[ 13%] Building CXX object tests/CMakeFiles/image.dir/image.cpp.o
[ 13%] Building CXX object tests/CMakeFiles/stencil.dir/stencil.cpp.o
[ 17%] Building CXX object tests/CMakeFiles/cast.dir/cast.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/reduce_by_key.dir/reduce_by_key.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/spmv.dir/spmv.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/tensordot.dir/tensordot.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/multivector_create.dir/multivector_create.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/sparse_matrices.dir/sparse_matrices.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/threads.dir/threads.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/multivector_arithmetics.dir/multivector_arithmetics.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/vector_arithmetics.dir/vector_arithmetics.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/events.dir/events.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/logical.dir/logical.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/reinterpret.dir/reinterpret.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/vector_create.dir/vector_create.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/boost_version.dir/boost_version.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/random.dir/random.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/custom_kernel.dir/custom_kernel.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/tagged_terminal.dir/tagged_terminal.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/deduce.dir/deduce.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/types.dir/types.cpp.o
[ 32%] Building CXX object tests/CMakeFiles/vector_copy.dir/vector_copy.cpp.o
[ 33%] Building CXX object tests/CMakeFiles/multiple_objects.dir/dummy1.cpp.o
[ 33%] Building CXX object tests/CMakeFiles/multiple_objects.dir/dummy2.cpp.o
[ 35%] Building CXX object tests/CMakeFiles/generator.dir/generator.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/mba.dir/mba.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/scan_by_key.dir/scan_by_key.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/sort.dir/sort.cpp.o
[ 43%] Building CXX object tests/CMakeFiles/constants.dir/constants.cpp.o
[ 43%] Building CXX object examples/CMakeFiles/fft_benchmark.dir/fft_benchmark.cpp.o
[ 45%] Building CXX object tests/CMakeFiles/eval.dir/eval.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/fft_profile.dir/fft_profile.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/exclusive.dir/exclusive.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/mba_benchmark.dir/mba_benchmark.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/benchmark.dir/benchmark.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/devlist.dir/devlist.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/complex_simple.dir/complex_simple.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/symbolic.dir/symbolic.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/complex_spmv.dir/complex_spmv.cpp.o
[ 50%] Building CXX object tests/CMakeFiles/svm.dir/svm.cpp.o
[ 51%] Linking CXX executable boost_version
[ 51%] Built target boost_version
/lus/home/pelyakime/AMGCL/vexcl_cce/tests/vector_copy.cpp:81:40: warning: lambda capture 'n' is not required to be captured for this use [-Wunused-lambda-capture]
    std::generate(i.begin(), i.end(), [n](){ return rand() % n; });
                                       ^
/lus/home/pelyakime/AMGCL/vexcl_cce/tests/threads.cpp:13:17: warning: lambda capture 'n' is not required to be captured for this use [-Wunused-lambda-capture]
    auto run = [n](vex::backend::command_queue queue, cl_long *s) {
                ^
In file included from /lus/home/pelyakime/AMGCL/vexcl_cce/tests/vector_create.cpp:3:
In file included from /lus/home/pelyakime/AMGCL/vexcl_cce/vexcl/vector.hpp:51:
/lus/home/pelyakime/AMGCL/vexcl_cce/vexcl/operations.hpp:755:42: error: call to implicitly-deleted default constructor of 'boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>'
    vector_expression(const Expr &expr = Expr())
                                         ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:901:30: note: in instantiation of default function argument expression for 'vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>' required here
    : public __bool_constant<__is_constructible(_Tp, _Args...)>
                             ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:139:26: note: in instantiation of template class 'std::__is_constructible_impl<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
                         ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:1224:14: note: in instantiation of template class 'std::__and_<std::__is_constructible_impl<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>, std::__is_implicitly_default_constructible_safe<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>>' requested here
    : public __and_<__is_constructible_impl<_Tp>,
             ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:139:26: note: in instantiation of template class 'std::__is_implicitly_default_constructible<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
                         ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:491:9: note: in instantiation of template class 'std::__and_<std::__is_implicitly_default_constructible<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>, std::__is_implicitly_default_constructible<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>>>' requested here
        return __and_<std::__is_implicitly_default_constructible<_Types>...
               ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:899:6: note: in instantiation of member function 'std::_TupleConstraints<true, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>>::__is_implicitly_default_constructible' requested here
            __is_implicitly_default_constructible(),
            ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:1063:66: note: in instantiation of template type alias '_ImplicitDefaultCtor' requested here
               _ImplicitDefaultCtor<is_object<_Alloc>::value, _T1, _T2> = true>
                                                                        ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:1065:2: note: while substituting prior template arguments into non-type template parameter [with _Alloc = vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>]
        tuple(allocator_arg_t __tag, const _Alloc& __a)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:1482:14: note: while substituting deduced template arguments into function template 'tuple' [with _Alloc = vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>, $1 = (no value)]
      return __result_type(std::forward<_Elements>(__args)...);
             ^
/lus/home/pelyakime/AMGCL/vexcl_cce/tests/vector_create.cpp:208:54: note: in instantiation of function template specialization 'std::make_tuple<const vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>, const vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>>' requested here
    std::tie(q, s) = vex::expression_properties(std::make_tuple(2 * x, x - 1));
                                                     ^
/opt/software/gaia/dev/1.0.3-e7de077a/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_p/boost-1.81.0-cce-15.0.1-cn4o/include/boost/proto/detail/preprocessed/basic_expr.hpp:212:97: note: default constructor of 'basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>' is implicitly deleted because field 'child1' of reference type 'boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>::proto_child1' (aka 'vex::vector<int> &') would not be initialized
        typedef Arg0 proto_child0; proto_child0 child0; typedef Arg1 proto_child1; proto_child1 child1;
                                                                                                ^
1 error generated.
make[2]: *** [tests/CMakeFiles/vector_create.dir/build.make:76: tests/CMakeFiles/vector_create.dir/vector_create.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:918: tests/CMakeFiles/vector_create.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 52%] Linking CXX executable devlist
[ 52%] Built target devlist
[ 53%] Linking CXX executable svm
[ 53%] Built target svm
[ 54%] Linking CXX executable types
[ 54%] Built target types
[ 55%] Linking CXX executable exclusive
[ 55%] Built target exclusive
[ 56%] Linking CXX executable multiple_objects
[ 56%] Built target multiple_objects
[ 57%] Linking CXX executable custom_kernel
[ 57%] Built target custom_kernel
[ 58%] Linking CXX executable vector_io
[ 58%] Built target vector_io
[ 60%] Linking CXX executable constants
[ 60%] Built target constants
[ 61%] Linking CXX executable mba_benchmark
[ 61%] Built target mba_benchmark
[ 62%] Linking CXX executable reinterpret
[ 62%] Built target reinterpret
[ 63%] Linking CXX executable context
[ 63%] Built target context
[ 64%] Linking CXX executable complex_simple
[ 64%] Built target complex_simple
[ 65%] Linking CXX executable eval
[ 65%] Built target eval
[ 66%] Linking CXX executable cast
[ 66%] Built target cast
[ 67%] Linking CXX executable image
[ 67%] Built target image
1 warning generated.
[ 68%] Linking CXX executable vector_copy
[ 68%] Built target vector_copy
[ 69%] Linking CXX executable multivector_create
[ 69%] Built target multivector_create
1 warning generated.
[ 70%] Linking CXX executable threads
[ 70%] Built target threads
[ 71%] Linking CXX executable logical
[ 71%] Built target logical
[ 72%] Linking CXX executable symbolic
[ 72%] Built target symbolic
[ 73%] Linking CXX executable mba
[ 73%] Built target mba
[ 74%] Linking CXX executable deduce
[ 74%] Built target deduce
[ 75%] Linking CXX executable events
[ 75%] Built target events
[ 76%] Linking CXX executable scan
[ 76%] Built target scan
[ 77%] Linking CXX executable multi_array
[ 77%] Built target multi_array
[ 78%] Linking CXX executable stencil
[ 78%] Built target stencil
[ 80%] Linking CXX executable reduce_by_key
[ 80%] Built target reduce_by_key
[ 81%] Linking CXX executable complex_spmv
[ 81%] Built target complex_spmv
[ 82%] Linking CXX executable tensordot
[ 82%] Built target tensordot
[ 83%] Linking CXX executable scan_by_key
[ 83%] Built target scan_by_key
[ 84%] Linking CXX executable tagged_terminal
[ 84%] Built target tagged_terminal
[ 85%] Linking CXX executable vector_pointer
[ 85%] Built target vector_pointer
[ 86%] Linking CXX executable generator
[ 86%] Built target generator
[ 87%] Linking CXX executable temporary
[ 87%] Built target temporary
[ 88%] Linking CXX executable random
[ 88%] Built target random
[ 89%] Linking CXX executable fft_profile
[ 89%] Built target fft_profile
[ 90%] Linking CXX executable fft_benchmark
[ 90%] Built target fft_benchmark
[ 91%] Linking CXX executable sparse_matrices
[ 91%] Built target sparse_matrices
[ 92%] Linking CXX executable vector_view
[ 92%] Built target vector_view
[ 93%] Linking CXX executable spmv
[ 93%] Built target spmv
[ 94%] Linking CXX executable multivector_arithmetics
[ 94%] Built target multivector_arithmetics
[ 95%] Linking CXX executable vector_arithmetics
[ 95%] Built target vector_arithmetics
[ 96%] Linking CXX executable fft
[ 96%] Built target fft
[ 97%] Linking CXX executable sort
[ 97%] Built target sort
[ 98%] Linking CXX executable benchmark
[ 98%] Built target benchmark
make: *** [Makefile:146: all] Error 2

@ddemidov
Copy link
Owner

ddemidov commented Apr 7, 2023

Looks like you do have some AMD GPUs. Try to replace these lines

https://github.com/ddemidov/amgcl/blob/276a6492f69e8c70a7e45baa32db500838952352/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp#L37-L42

with

std::cout << world.rank << ": " << ctx << std::endl;

@pelyakim
Copy link
Author

pelyakim commented Apr 7, 2023

After remplace these lines, I have a new error

cpu-bind=MASK - g1245, task  0  0 [1121470]: mask 0xffffffff00000000ffffffff set
cpu-bind=MASK - g1245, task  1  1 [1121472]: mask 0xffffffff00000000ffffffff00000000 set
terminate called after throwing an instance of 'std::runtime_error'
  what():  Empty VexCL context!
srun: error: g1245: task 1: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=116528.0
slurmstepd: error: *** STEP 116528.0 ON g1245 CANCELLED AT 2023-04-07T11:42:43 ***
srun: error: g1245: task 0: Terminated
srun: Force Terminated StepId=116528.0

And my output file is :

1: 
0: 1. gfx90a:sramecc+:xnack- (AMD Accelerated Parallel Processing)

World size: 2
Matrix poisson3Db.bin: 85623x85623
RHS poisson3Db_b.bin: 85623x1

@ddemidov
Copy link
Owner

ddemidov commented Apr 7, 2023

So only one of your MPI processes got a GPU (do you only have one?). The context is created in Exclusive mode here:

https://github.com/ddemidov/amgcl/blob/276a6492f69e8c70a7e45baa32db500838952352/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp#L36

you can replace it with

vex::Context ctx(vex::Filter::Count(1));

but then each of your MPI processes will use the same GPU, which would not be effective (but should work). In general, it is better to start a single MPI process per GPU.

@pelyakim
Copy link
Author

pelyakim commented Apr 7, 2023

Ok, thanks a lot for your explications, it works now. Indeed I had reserved only 1 gpu for the test in my slurm script. I can reserve others of course. For the simulation that I want to launch later, if everything works well, I think of using several MPI processes per GPU. Indeed, my code being hybrid MPI - GPU, only the resolution of the pseudo Poisson equation representing the resolution of the pressure in the Navier-Stokes equations is solved on the GPUs, this allows to reduce the MPI part of the code, while keeping the efficiency of the GPUs. But now, I have to test with the matrix that my code builds and the 2nd member. Thanks for your help. I will keep you informed of my progress. Have a nice day

@pelyakim
Copy link
Author

Hello,
I tested on the example poisson3Db_mpi_vexcl with a matrix that I got from my code with one of my test cases and its second member and I managed to get a solution that seems correct.
I would now like to integrate this resolution of the linear system Ax=b in my JADIM code (IMFT, France), however I have some difficulties.
The partitioning is already done in the code (I have a partitioning matrix that defines the rank of the process for each cell of the mesh) and the matrix is in CSR format. Each MPI process contains the partitioned CSR matrix, i.e. each one has its part of the matrix. Example: for an MPI partitioning by 2 in X, 1 in Y and 1 in Z, rank 0 will own the first half of the matrix and rank 1 the other half.
Thus, I started from the example poisson3Db_mpi_vexcl by replacing the reading of the CSR matrix and the second member by the parts of the matrix A (in fact they are pointers on the arrays row_offset, ia and val) and of the second member that each MPI process has. I copied all these arrays into vector<> . I checked that I had the same values as with my first work on the example poisson3Db_mpi_vexcl with the same matrix (but it is global) and my input vectors look good.
However, when I get to the Solver solve(world, A, prm, bprm); I get an "out of range" error.
I'm not sure where this could come from, could you give me an idea?

Also, my second question, could I easily use my partitioning without having to recode the MPI distribution of the matrix?

Thank you very much for your help which is very precious to me.

Here is the function where I integrate the resolution of AX=b in my code

#include <vector>
#include <iostream>
#include <ctime>

#include <amgcl/backend/vexcl.hpp>
#include <amgcl/adapter/crs_tuple.hpp>

#include <amgcl/mpi/distributed_matrix.hpp>
#include <amgcl/mpi/make_solver.hpp>
#include <amgcl/mpi/amg.hpp>
#include <amgcl/mpi/coarsening/smoothed_aggregation.hpp>
#include <amgcl/mpi/relaxation/spai0.hpp>
#include <amgcl/mpi/solver/bicgstab.hpp>

#include <amgcl/io/binary.hpp>
#include <amgcl/profiler.hpp>

// #if defined(AMGCL_HAVE_PARMETIS)
// #  include <amgcl/mpi/partition/parmetis.hpp>
// #elif defined(AMGCL_HAVE_SCOTCH)
#include <amgcl/mpi/partition/ptscotch.hpp>
// #endif


using namespace std;



extern "C" {

  void AMGCL_cg_amg_mpi(  double *matval, int *ia, int *ja,
                double *rhs_jadim, double *sol, int &nip,
                int &njp, int &nkp, int &nnz, int &npt, int &npt0, int &irovar, int &nloc, int &t_p, int &maxit )
/*  void AMGCL_cg_amg_mpi(  double *matval, int *ia, int *ja,
                double *rhs_jadim, double *sol, int &nip,
                int &njp, int &nkp, int &nnz, int &npt, int &npt0, int &irovar, int &nloc, int &t_p, MPI_Comm *comm_c_AMGCL, int &maxit ) */ //double norm )
  {
    FILE *f1, *f2, *f3, *f4;
    int nijkp = nip*njp*nkp;
    int num_procs;
    clock_t c_start, c_end;

    cout << "Check Params d'entree : " << nip << ", " <<njp << ", " << nkp << ", " << nnz << ", " << npt << ", " << npt0 << ", " << irovar << ", " << nloc << " , "<< t_p << ", " << maxit << ", " << endl;// num_procs << endl;

//     MPI_Comm_size(*comm_c_AMGCL, &num_procs);
    amgcl::mpi::communicator world(MPI_COMM_WORLD);


    // Attente de tous les procs
     MPI_Barrier(world);

    // Create VexCL context. Use vex::Filter::Exclusive so that different MPI
    // processes get different GPUs. Each process gets a single GPU:
    vex::Context ctx(vex::Filter::Exclusive(vex::Filter::Count(1)));
    std::cout << world.rank << ": " << ctx << std::endl;

     // The profiler:
    amgcl::profiler<> prof("JADIM MPI(VexCL)");

    // Read the system matrix and the RHS:
    prof.tic("read");

    // Get the global size of the matrix:
    ptrdiff_t rows_global = nijkp; //amgcl::io::crs_size<ptrdiff_t>(argv[1]);
    ptrdiff_t cols = 1;
    ptrdiff_t rows = nloc;
    ptrdiff_t chunk = nloc;

    cout << world.rank << " - rows_global :" << rows_global << "rows : " << rows <<  " cols :" << cols << " chunk: " << chunk << endl;

//     // Split the matrix into approximately equal chunks of rows_global
//     ptrdiff_t chunk = (rows_global + world.size - 1) / world.size;
//     ptrdiff_t row_beg = std::min(rows_global, chunk * t_p);
//     ptrdiff_t row_end = std::min(rows_global, row_beg + chunk);
//     chunk = row_end - row_beg;
//
//     cout << world.rank << ": chunk : " << chunk << " row_beg: " << row_beg << " row_end: " << row_end << endl;


//     amgcl::io::read_crs(argv[1], rows, row_offset, col, val, row_beg, row_end);
//     amgcl::io::read_dense(argv[2], rows, cols, rhs, row_beg, row_end);

    // ---------- 1 - Copy matval, ia, ja, rhs_jadim and sol in tempory buffer ---------

    c_start = clock();

    // Read our part of the system matrix and the RHS.
    vector<ptrdiff_t> row_offset(nloc+1), col(nnz);
    vector<double> val(nnz), rhs(nloc), in_x(nloc);

    // Copie dans des tableaux temporaires de ia, ja, matval, rhs_jadim et sol
//     cout << "Copie ia, ja et matval" << endl;

    for (int i=0; i<nloc+1; ++i) {
      row_offset[i]=ia[i];
//       if (t_p == 0 ) cout <<  row_offset[i] <<  "  " << ia[i] << endl;
    }

    for (int i=0; i<nnz; ++i) {
      col[i] = ja[i];
      val[i] = matval[i];
//       if (t_p == 0 ) cout << col[i] << " " << val[i] << endl;
    }
//     cout << "Copie rhs_jadim et sol" << endl;
    for (int i=0; i<nloc; ++i) {
      rhs[i]  = rhs_jadim[i];
      in_x[i] = sol[i];
//       if (t_p == 0) cout << rhs[i] << " " << in_x[i] << endl;
    }

    // Stop time measurement
    if (t_p == 0) cout << "Time to copy buffer : " << (clock() - c_start) / 1e6 << endl;

    prof.toc("read");


    // Copy the RHS vector to the backend:
    vex::vector<double> f(ctx, rhs);

    if (t_p == 0)
        std::cout
            << "World size: " << world.size << std::endl
            << "Matrix " << ": " << rows << "x" << rows << std::endl
            << "RHS "    << ": " << rows << "x" << cols << std::endl;

    // Compose the solver type
    typedef amgcl::backend::vexcl<double> DBackend;
    typedef amgcl::backend::vexcl<float>  FBackend;
    typedef amgcl::mpi::make_solver<
        amgcl::mpi::amg<
            FBackend,
            amgcl::mpi::coarsening::smoothed_aggregation<FBackend>,
            amgcl::mpi::relaxation::spai0<FBackend>
            >,
        amgcl::mpi::solver::bicgstab<DBackend>
        > Solver;

    cout << world.rank << " - Before make_shared" << endl;
    // Create the distributed matrix from the local parts.
    auto A = std::make_shared<amgcl::mpi::distributed_matrix<DBackend>>(
            world, std::tie(chunk, row_offset, col, val));
//     auto A = std::make_shared<amgcl::mpi::distributed_matrix<DBackend>>(
//             *comm_c_AMGCL, std::tie(chunk, row_offset, col, val));
    cout << world.rank << " - After make_shared" << endl;

    // Attente de tous les procs
     MPI_Barrier(world);

    typedef amgcl::mpi::partition::ptscotch<DBackend> Partition;

    if (world.size > 1) {
        prof.tic("partition");
        Partition part;

        // part(A) returns the distributed permutation matrix:
        auto P = part(*A);
        auto R = transpose(*P);

        // Reorder the matrix:
        A = product(*R, *product(*A, *P));

        // and the RHS vector:
        vex::vector<double> new_rhs(ctx, R->loc_rows());
        R->move_to_backend(typename DBackend::params());
        amgcl::backend::spmv(1, *R, f, 0, new_rhs);
        f.swap(new_rhs);

        // Update the number of the local rows
        // (it may have changed as a result of permutation):
        chunk = A->loc_rows();
        prof.toc("partition");
    }

    // Attente de tous les procs
    MPI_Barrier(world);

    cout << world.rank << " - After partition" << endl;

    // Initialize the solver:
    Solver::params prm;
    DBackend::params bprm;
    bprm.q = ctx;

    prof.tic("setup");
//     Solver solve(*comm_c_AMGCL, A, prm, bprm);
    Solver solve(world, A, prm, bprm);
    prof.toc("setup");

    cout << world.rank << " - After solve" << endl;

    // Show the mini-report on the constructed solver:
    if (t_p == 0)
        std::cout << solve << std::endl;

    // Solve the system with the zero initial approximation:
    int iters;
    double error;
    vex::vector<double> x(ctx, chunk);
    x = 0.0;

    prof.tic("solve");
    std::tie(iters, error) = solve(*A, f, x);
    prof.toc("solve");

    // Output the number of iterations, the relative error,
    // and the profiling data:
    if (t_p == 0)
        std::cout
            << "Iters: " << iters << std::endl
            << "Error: " << error << std::endl
            << prof << std::endl;

  }
}

@pelyakim
Copy link
Author

This is my error output


The following have been reloaded with a version change:
  1) cray-libsci/22.11.1.2 => cray-libsci/23.02.1.1
  2) cray-mpich/8.1.21 => cray-mpich/8.1.24
  3) perftools-base/22.09.0 => perftools-base/23.02.0
  4) rocm/5.2.3 => rocm/5.2.0


Currently Loaded Modules:
  1) craype-network-ofi        9) cray-mpich/8.1.24
  2) craype-x86-trento        10) craype/2.7.19
  3) craype-accel-amd-gfx90a  11) perftools-base/23.02.0
  4) libfabric/1.15.2.0       12) rocm/5.2.0
  5) PrgEnv-cray/8.3.3        13) cpe/23.02
  6) cce/15.0.1               14) CPE-23.02-cce-15.0.1-GPU-softs
  7) cray-dsmml/0.2.2         15) scotch/6.1.3-mpi
  8) cray-libsci/23.02.1.1    16) boost/1.81.0-mpi-python3

 

cpu-bind=MASK - g1235, task  0  0 [3713689]: mask 0xffffffff00000000ffffffff set
cpu-bind=MASK - g1235, task  1  1 [3713690]: mask 0xffffffff00000000ffffffff00000000 set
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
srun: error: g1235: task 1: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=176864.0
slurmstepd: error: *** STEP 176864.0 ON g1235 CANCELLED AT 2023-04-13T16:44:18 ***
srun: error: g1235: task 0: Terminated
srun: Force Terminated StepId=176864.0
slurm-176864.out (END)

@ddemidov
Copy link
Owner

Try to read this page: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html

There I tried to explain what amgcl expects from the partitioned matrix. In short, each MPI process should contain consecutive row-wise chanks of the matrix, and the columns should have global numbering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants