Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run example AOMP program on V520 #193

Open
drajarshi opened this issue Mar 9, 2021 · 8 comments
Open

Unable to run example AOMP program on V520 #193

drajarshi opened this issue Mar 9, 2021 · 8 comments
Labels
enhancement New feature or request gfx10

Comments

@drajarshi
Copy link

drajarshi commented Mar 9, 2021

I am trying to run a openMP program on a instance with AMD EPYC 7R32 CPU/ V520 GPU. This is on a AWS shared instance.

I installed AOMP 11.12.0 and the ROCm dependencies.

However, when I try to compile and run the veccopy example under AOMP install folder,

[ec2-user@ip-172-31-42-182 veccopy]$ sudo make run
Makefile:28: AOMP not found at /root/rocm/aomp
/usr/lib/aomp/bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx900 veccopy.c -o veccopy
./veccopy
[/root/git/aomp11/amd-llvm-project/openmp/libomptarget/plugins/amdgpu/impl/system.cpp:515] Initializing the hsa runtime failed: HSA_STATUS_ERROR_OUT_OF_RESOURCES
make: *** [run] Error 1

I am unable to figure out the meaning of the above error and how to fix it.

Then I modified the Makefile to specify the GPU as gfx1011 (device type for V520) (line in bold),

[ec2-user@ip-172-31-42-182 veccopy]$ grep AOMP_GPU Makefile
.......................
INSTALLED_GPU = $(shell $(AOMP)/bin/mygpu -d gfx900)# Default AOMP_GPU is gfx900 which is vega
AOMP_GPU ?= $(INSTALLED_GPU)
AOMP_GPU = gfx1011 # for the V520 device

......................

......................

[ec2-user@ip-172-31-42-182 veccopy]$ sudo make run
Makefile:28: AOMP not found at /root/rocm/aomp
/usr/lib/aomp/bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1011 veccopy.c -o veccopy
clang-11: error: no such file or directory: 'libomptarget-amdgcn-gfx1011.bc'
clang-11: error: no such file or directory: 'libaompextras-amdgcn-gfx1011.bc'
make: *** [veccopy] Error 1

The bitcode file for gfx1011 is not available in the rocm install folder.

[ec2-user@ip-172-31-42-182 veccopy]$ find / -name libomptarget-amdgcn* 2>/dev/null
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx700.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx701.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx801.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx803.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx900.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx902.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx906.bc
/usr/lib/aomp_11.12-0/lib/libdevice/libomptarget-amdgcn-gfx908.bc

The same list above shows under /opt/rocm-4.0.0/llvm/lib/ as well.

Here's my rocm install list:

[ec2-user@ip-172-31-42-182 veccopy]$ rpm -qa | grep rocm
rocm-dbgapi-0.42.0.40000-23.el7.x86_64
rocm-opencl-devel-3.6Beta_17_g875c1f8_rocm_rel_4.0_23-1.x86_64
rocm-device-libs-1.0.0.637_rocm_rel_4.0_23_db8c0c3-1.x86_64
rocm-gdb-10.1_rocm_rel_4.0_23-1.x86_64
hsa-rocr-dev-1.2.40000.0_rocm_rel_4.0_23_a5173c90-1.x86_64
rocminfo-1.40000.0-1.x86_64
rocm-opencl-3.6Beta_17_g875c1f8_rocm_rel_4.0_23-1.x86_64
rocm-clang-ocl-0.5.0.64_rocm_rel_4.0_23_50fb51a-1.x86_64
rocm-smi-lib64-2.9.0.9_rocm_rel_4.0_23_4b49d2d-1.x86_64
rocm-cmake-0.3.0.153_rocm_rel_4.0_23_1d1caa5-1.x86_64
rocm-dkms-4.0.0.40000-23.el7.x86_64
comgr-1.9.0.194_rocm_rel_4.0_23_0fa438b-1.x86_64
rocm-utils-4.0.0.40000-23.el7.x86_64
rocm-smi-3.8.0-1.el7.noarch
rocm-dev-4.0.0.40000-23.el7.x86_64

Please suggest how to get the openMP examples to run successfully on the V520 GPU.

Thanks in advance.

Regards,

Rajarshi Das

@drajarshi
Copy link
Author

I subsequently thought it might be due to both AOMP 11.12.0 (based on ROCm 3.10) and ROCm 4.0.0 being installed.
Hence, I did a fresh install of ROCm 4.0.0 on a separate identical AWS instance.

In the /opt/rocm/llvm/examples/veccopy/ folder, I modified the Makefile with the following variable settings:
AOMP_GPU=gfx900
OFFLOAD_DEBUG=1

Subsequently, I see the following output:
$ sudo make run
Makefile:28: AOMP not found at /root/rocm/aomp
DEBUG Mode ON
LIBOMPTARGET_DEBUG=1 ./veccopy
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.hsa.so'...
Target HSA RTL --> Start initializing HSA-ATMI
Target HSA RTL --> There are 1 devices supporting HSA.
Target HSA RTL --> Device 0: Initial groupsPerDevice 128 & threadsPerGroup 256
Libomptarget --> Successfully loaded library '/opt/rocm/llvm/lib-debug/libomptarget.rtl.hsa.so'!
Libomptarget --> Registering RTL libomptarget.rtl.hsa.so supporting 1 devices!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000400ec0 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000400ec0 is compatible with RTL libomptarget.rtl.hsa.so!
Libomptarget --> RTL 0x00000000015809b0 has index 0!
Libomptarget --> Registering image 0x0000000000400ec0 with RTL libomptarget.rtl.hsa.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Entering target region with entry point 0x0000000000400e50 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target HSA RTL --> Init requires flags to 1
Target HSA RTL --> Initialize the device id: 0
Target HSA RTL --> Using 36 compute unis per grid
Target HSA RTL --> Using 1024 ROCm blocks per grid
Target HSA RTL --> Capped thread limit: 1024
Target HSA RTL --> Queried wavefront size: 32
Target HSA RTL --> Default number of teams set according to library's default 128
Target HSA RTL --> Default number of threads set according to library's default 256
Target HSA RTL --> Device 0: default limit for groupsPerDevice 1024 & threadsPerGroup 1024
Target HSA RTL --> Device 0: wavefront size 32, total threads 1024 x 1024 = 1048576
Libomptarget --> Device 0 is ready to use.
Target HSA RTL --> "Module registering" failed
Possible gpu arch mismatch: gfx1011, please check compiler: -march= flag
Libomptarget --> Unable to generate entries table for device id 0.
Libomptarget --> Failed to init globals on device 0
Libomptarget --> Failed to get device 0 ready
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
make: *** [run] Aborted

What does the message 'Target HSA RTL --> "Module registering" failed refer to? The next line indicates a possible gpu arch mismatch since the GPU id of V520 is gfx1011 while I built the code for gfx900.
The mygpu program (/opt/rocm/bin/mygpu) returns unknown. This is because the gputable.txt in the bin/ folder does not have an entry for gfx1011.
So, if I set the variable AOMP_GPU=gfx1011 in the Makefile,
the build step fails:
clang-12: error: no such file or directory: 'libomptarget-amdgcn-gfx1011.bc'
clang-12: error: no such file or directory: 'libaompextras-amdgcn-gfx1011.bc'
clang-12: error: no such file or directory: 'libm-amdgcn-gfx1011.bc'
make: *** [veccopy] Error 1

Is it possible to generate a gfx1011.bc from an existing .bc such as a gfx900.bc e.g., in order to get the veccopy example to build and run?

Thanks.

@JonChesterfield
Copy link
Contributor

OpenMP does not yet support gfx10. You could create the corresponding gfx1011.bc file by adding the number to the devicertl cmake file, but the end result will not work correctly. I'll ping the team with this, see if we can raise the priority of gfx10 implementation.

@drajarshi
Copy link
Author

Thanks @JonChesterfield for your comments.
I didn't quite follow your suggestion about modifying the devicertl cmake file. So, I tried the approach below:
I copied over the libomptarget-amdgcn-gfx900.bc, got the .ll and then replaced the gfx900 string with gfx1011 in the attributes. I saw stuff like +gfx9-insts but didn't add +gfx10-insts since I wasn't sure about it, and then set the Module ID as well to gfx1011 and assembled it again with:
$ llvm-as <.ll>
I then placed the .bc in the respective folders. This time around, the 'sudo make run' for the veccopy example completed, and I saw the following output:
_[ec2-user@ip-172-31-42-182 veccopy]$ sudo make
Makefile:28: AOMP not found at /root/rocm/aomp
DEBUG Mode ON
env LIBRARY_PATH=/usr/lib/aomp/lib-debug /usr/lib/aomp/bin/clang -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1011 veccopy.c -o veccopy
[ec2-user@ip-172-31-42-182 veccopy]$ sudo make run
Makefile:28: AOMP not found at /root/rocm/aomp
DEBUG Mode ON
LIBOMPTARGET_DEBUG=1 ./veccopy
Libomptarget --> Init target library!
ompt_pre_init(): tool_setting = 1
ompt_pre_init(): ompt_enabled = 0
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library '/usr/lib/aomp/lib-debug/libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library '/usr/lib/aomp/lib-debug/libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library '/usr/lib/aomp/lib-debug/libomptarget.rtl.amdgpu.so'...
Target AMDGPU RTL --> Start initializing HSA-ATMI
Target AMDGPU RTL --> There are 1 devices supporting HSA.
Target AMDGPU RTL --> Device 0: Initial groupsPerDevice 128 & threadsPerGroup 256
Libomptarget --> Successfully loaded library '/usr/lib/aomp/lib-debug/libomptarget.rtl.amdgpu.so'!
Libomptarget --> Registering RTL libomptarget.rtl.amdgpu.so supporting 1 devices!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000000000400ee0 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000000000400ee0 is compatible with RTL libomptarget.rtl.amdgpu.so!
Libomptarget --> RTL 0x00000000016f2840 has index 0!
Libomptarget --> Registering image 0x0000000000400ee0 with RTL libomptarget.rtl.amdgpu.so!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Entering target region with entry point 0x0000000000400e70 and device Id -1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target AMDGPU RTL --> Init requires flags to 1
Target AMDGPU RTL --> Initialize the device id: 0
Target AMDGPU RTL --> Using 36 compute unis per grid
Target AMDGPU RTL --> Using 1024 ROCm blocks per grid
Target AMDGPU RTL --> Capped thread limit: 1024
Target AMDGPU RTL --> Queried wavefront size: 32
Target AMDGPU RTL --> Default number of teams = 1 * number of compute units 36
Target AMDGPU RTL --> Default number of threads set according to library's default 256
Target AMDGPU RTL --> Device 0: default limit for groupsPerDevice 1024 & threadsPerGroup 1024
Target AMDGPU RTL --> Device 0: wavefront size 32, total threads 1024 x 1024 = 1048576
Libomptarget --> Device 0 is ready to use.
Target AMDGPU RTL --> Setting global device environment 12 bytes
Target AMDGPU RTL --> "Module registering" succeeded
Target AMDGPU RTL --> ATMI module successfully loaded!
Target AMDGPU RTL --> to find the kernel name: __omp_offloading_10302_140b27f_main_l18 size: 39
Target AMDGPU RTL --> KernDescVal size 8 does not match advertized size 7 for '__omp_offloading_10302_140b27f_main_l18_kern_desc'
Target AMDGPU RTL --> After loading global for __omp_offloading_10302_140b27f_main_l18_kern_desc KernDesc
Target AMDGPU RTL --> KernDesc: Version: 2
Target AMDGPU RTL --> KernDesc: TSize: 7
Target AMDGPU RTL --> KernDesc: WG_Size: 0
Target AMDGPU RTL --> KernDesc: Mode: 0
Target AMDGPU RTL --> ExecModeVal 0
Target AMDGPU RTL --> Setting KernDescVal.WG_Size to default 256
Target AMDGPU RTL --> WGSizeVal 256
Target AMDGPU RTL --> "Loading KernDesc computation property" succeeded
Target AMDGPU RTL --> Construct kernelinfo: ExecMode 0
Target AMDGPU RTL --> Entry point 0 maps to __omp_offloading_10302_140b27f_main_l18
Libomptarget --> Entry 0: Base=0x00000000000186a0, Begin=0x00000000000186a0, Size=4, Type=0x320
Libomptarget --> Entry 1: Base=0x00000000000186a0, Begin=0x00000000000186a0, Size=8, Type=0x320
Libomptarget --> Entry 2: Base=0x00007ffe39ce7900, Begin=0x00007ffe39ce7900, Size=400000, Type=0x22
Libomptarget --> Entry 3: Base=0x00000000000186a0, Begin=0x00000000000186a0, Size=8, Type=0x320
Libomptarget --> Entry 4: Base=0x00007ffe39c85e80, Begin=0x00007ffe39c85e80, Size=400000, Type=0x21
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)...
Target AMDGPU RTL --> Tgt alloc data 400000 bytes, (tgt:00007f41fd406000).
Libomptarget --> Creating new map entry: HstBase=0x00007ffe39ce7900, HstBegin=0x00007ffe39ce7900, HstEnd=0x00007ffe39d49380, TgtBegin=0x00007f41fd406000
Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd406000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)...
Target AMDGPU RTL --> Tgt alloc data 400000 bytes, (tgt:00007f41fd468000).
Libomptarget --> Creating new map entry: HstBase=0x00007ffe39c85e80, HstBegin=0x00007ffe39c85e80, HstEnd=0x00007ffe39ce7900, TgtBegin=0x00007f41fd468000
Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd468000 - is new
Libomptarget --> Moving 400000 bytes (hst:0x00007ffe39c85e80) -> (tgt:0x00007f41fd468000)
Target AMDGPU RTL --> Submit data 400000 bytes, (hst:00007ffe39c85e80) -> (tgt:00007f41fd468000).
Libomptarget --> Forwarding first-private value 0x00000000000186a0 to the target construct
Libomptarget --> Forwarding first-private value 0x00000000000186a0 to the target construct
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39ce7900, TgtPtrBegin=0x00007f41fd406000, Size=400000, RefCount=1
Libomptarget --> Obtained target argument 0x00007f41fd406000 from host pointer 0x00007ffe39ce7900
Libomptarget --> Forwarding first-private value 0x00000000000186a0 to the target construct
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39c85e80, TgtPtrBegin=0x00007f41fd468000, Size=400000, RefCount=1
Libomptarget --> Obtained target argument 0x00007f41fd468000 from host pointer 0x00007ffe39c85e80
Libomptarget --> Launching target execution _omp_offloading_10302_140b27f_main_l18 with pointer 0x0000000001730c80 (index=0).
Target AMDGPU RTL --> Run target team region thread_limit 0
Target AMDGPU RTL --> Arg_num: 5
Target AMDGPU RTL --> Offseted base: arg[0]:0x00000000000186a0
Target AMDGPU RTL --> Offseted base: arg[1]:0x00000000000186a0
Target AMDGPU RTL --> Offseted base: arg[2]:0x00007f41fd406000
Target AMDGPU RTL --> Offseted base: arg[3]:0x00000000000186a0
Target AMDGPU RTL --> Offseted base: arg[4]:0x00007f41fd468000
Target AMDGPU RTL --> Preparing 256 threads
Target AMDGPU RTL --> Set default num of groups 36
Target AMDGPU RTL --> Final 1 num_groups and 256 threadsPerGroup
Target AMDGPU RTL --> Kernel completed
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39c85e80, TgtPtrBegin=0x00007f41fd468000, Size=400000, updated RefCount=1
Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd468000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39c85e80, Size=400000)...
Libomptarget --> Deleting tgt data 0x00007f41fd468000 of size 400000
Target AMDGPU RTL --> Tgt free data (tgt:00007f41fd468000).
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffe39c85e80, TgtPtrBegin=0x00007f41fd468000, Size=400000
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffe39ce7900, TgtPtrBegin=0x00007f41fd406000, Size=400000, updated RefCount=1
Libomptarget --> There are 400000 bytes allocated at target address 0x00007f41fd406000 - is last
Libomptarget --> Moving 400000 bytes (tgt:0x00007f41fd406000) -> (hst:0x00007ffe39ce7900)
Target AMDGPU RTL --> Retrieve data 400000 bytes, (tgt:00007f41fd406000) -> (hst:00007ffe39ce7900).
Target AMDGPU RTL --> DONE Retrieve data 400000 bytes, (tgt:00007f41fd406000) -> (hst:00007ffe39ce7900).
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffe39ce7900, Size=400000)...
Libomptarget --> Deleting tgt data 0x00007f41fd406000 of size 400000
Target AMDGPU RTL --> Tgt free data (tgt:00007f41fd406000).
Libomptarget --> Removing mapping with HstPtrBegin=0x00007ffe39ce7900, TgtPtrBegin=0x00007f41fd406000, Size=400000
Success
Target AMDGPU RTL --> Finalizing the HSA-ATMI DeviceInfo.
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000000000400ee0 is compatible with RTL 0x00000000016f2840!
Libomptarget --> Unregistered image 0x0000000000400ee0 from RTL 0x00000000016f2840!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x0000000000404740
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!

I am assuming that the above debug output shows that the veccopy example runs on a gfx1011 device like a gfx9* device.
Is this ok to use as a workaround in the absence of the actual gfx1011.bc (with the gfx10 features enabled)?
Please let me know your thoughts.

Also, please let me know how to register for a notification once the gfx1011.bc is built and available to use.
Thanks.

@JonChesterfield
Copy link
Contributor

GFX10 is not expected to work on aomp. It's near the top of my todo list.

That trace shows it worked better than expected (except that the runtime should probably have said 'gfx10 is unsupported, sorry' and aborted). LLVM's backend is expected to work for gfx10, but the various places in openmp that assume a wavefront size of 64 will be incorrect for gfx10 (as it has a wavefront size of 32). That might work out for some simple cases as it sort of looks like a 64 wide machine with the top half inactive.

The cmake I meant is the one at https://github.com/ROCm-Developer-Tools/llvm-project/blob/aomp13.0-2/openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt where a variable LIBOMPTARGET_AMDGCN_GFXLIST controls which architectures are built.

I don't know of a notification system I could use. I'll probably remember to ping this thread once it's passing our tests, but unfortunately github is routed to my spam folder so there's some lag.

@gregrodgers
Copy link
Contributor

AOMP support for gfx10 is TBD. See issue 187.

@gregrodgers gregrodgers added enhancement New feature or request gfx10 labels Apr 20, 2021
@drajarshi
Copy link
Author

drajarshi commented Apr 27, 2021 via email

@JonChesterfield
Copy link
Contributor

Some support for gfx10 is in trunk now. It isn't heavily tested yet and has not yet reached aomp. Patch enabling it was https://reviews.llvm.org/D108708

@ppanchad-amd
Copy link

@drajarshi Do you still need assistance with this ticket? If not, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gfx10
Projects
None yet
Development

No branches or pull requests

4 participants