Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build "fat binary" with multiple GPU archs #139

Open
FilipVaverka opened this issue Aug 31, 2020 · 13 comments
Open

Build "fat binary" with multiple GPU archs #139

FilipVaverka opened this issue Aug 31, 2020 · 13 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@FilipVaverka
Copy link

Is it possible to build "fat binary" which is compatible with multiple GPU architectures?
I'm able to build binary with OpenMP target offload for single GPU architecture (such as AOMP_GPU=gfx900), but such a binary fails to run on gfx906 hardware (as expected). Is there any way to specify multiple GPU architectures to be included in the binary?

@JonChesterfield
Copy link
Contributor

This is unimplemented but possible. It'll involve generating code for N architectures and embedding them all in the host binary.

It would be more difficult to support running on various different GPU architectures at the same time, e.g. a machine with some gfx803 and some gfx906. Everything is more difficult with cross vendor too, so nvptx64 + amdgcn in one binary would be more challenging to implement.

Is this the simpler build once, deploy to various homogeneous machines use case?

@FilipVaverka
Copy link
Author

I think the "build once, deploy everywhere" is priority (at least for client applications). However, I hit the issue on my development machine, which has 2 GPUs (RX Vega - gfx900 and Radeon VII - gfx906). In this case OpenMP will report two devices available, but only one can be used as there is no binary for the other - here it could be better to report only those devices which can actually be used.

@gregrodgers gregrodgers self-assigned this Oct 26, 2020
@gregrodgers gregrodgers added bug Something isn't working enhancement New feature or request labels Oct 26, 2020
@gregrodgers
Copy link
Contributor

There are three enhancements here in order of least difficult to implement to most difficult. 1. Fat binary for the same architecture but different GPUs and features. 2) Two architectures such as amdgcn and nvptx64 but only one is active for execution and 3) the combination of two concurrent devices with different architectures or different GPUs during execution. The last would require device type management in the OpenMP standard which is not expected till at least OpenMP 6.0. I would like to see a strong use case for the third to take to the OpenMP Language committee.

@gregrodgers
Copy link
Contributor

This is targeted for next release aomp 13.0-3

@gregrodgers
Copy link
Contributor

AOMP 13.0-3 will only support multiple archs in the binary, not multiple archs concurrently.

@gregrodgers gregrodgers changed the title Build "fat binary" using aompcc with AOMP_GPU environment variable Build "fat binary" with multiple GPU archs Jul 6, 2021
@gregrodgers
Copy link
Contributor

In aomp 13.0-4 you can build a multi-arch binary with multiple --offload-arch flags.

@FilipVaverka
Copy link
Author

How does mapping between devices and images work?
For example, I can now compile my code as

aompcc --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test

However, resulting binary seems to be able to run only on "gfx906" as I can run successfully it with

ROCR_VISIBLE_DEVICES=1 ./test

but with 'ROCR_VISIBLE_DEVICES=0', which is "gfx900" I get

Possible gpu arch mismatch: device:gfx900, image:gfx906 please check compiler flag: -march=
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Now, if I build only with --offload-arch gfx900 I cant run the binary regardless of ROCR_VISIBLE_DEVICES and I get

WARNING: Runtime capabilities do NOT meet any image requirements.
So device offloading is now disabled.
Runtime capabilities : gfx906 xnack-
Image 0 requirements : gfx900

or

WARNING: Runtime capabilities do NOT meet any image requirements.
So device offloading is now disabled.
Runtime capabilities : gfx906 sramecc- xnack-
Image 0 requirements : gfx900

@gregrodgers
Copy link
Contributor

gregrodgers commented Jul 12, 2021

Thank you for your patience. We just reviewed your issue in our weekly meeting. The first thing to mention that I believe you are aware of is that you cannot offload to two different device types in the same application instance. Yes, you should be able to build for multiple architectures and isolate the GPUs using ROCR_VISIBLE_DEVICES as you have tried.

We do not have a machine with two different cards for this type of testing and appreciate you helping us get this work.

Can you run these four commands and show us the output.

$AOMP/bin/offload-arch -c
env ROCR_VISIBLE_DEVICES=0 $AOMP/bin/offload-arch -c
env ROCR_VISIBLE_DEVICES=1 $AOMP/bin/offload-arch -c
$AOMP/bin/offload-arch -v

Can you switch to using clang++ (instead of aompcc)? Both clang and clang++ now support the --offload-arch flag. I believe we have a bug in the aompcc script for multiple architectures. Compile with this command:

$AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test

Thanks again for your help.

@FilipVaverka
Copy link
Author

FilipVaverka commented Jul 12, 2021

Here are tests of offload-arch script:

Login@Machine:~> $AOMP/bin/aompcc --version
13.0-4
Login@Machine:~> $AOMP/bin/offload-arch -c
gfx906   sramecc- xnack-
Login@Machine:~> env ROCR_VISIBLE_DEVICES=0 $AOMP/bin/offload-arch -c
gfx906   xnack-
Login@Machine:~> env ROCR_VISIBLE_DEVICES=1 $AOMP/bin/offload-arch -c
gfx906   sramecc- xnack-
Login@Machine:~> $AOMP/bin/offload-arch -v
gfx906 VEGA20 1002:66AF amdgcn-amd-amdhsa
gfx900 VEGA10 1002:687F amdgcn-amd-amdhsa
Login@Machine:~> env ROCR_VISIBLE_DEVICES= $AOMP/bin/offload-arch -c # Also for any other non-existent device index
Segmentation fault (core dumped)

And here clang++ test you suggested. I'm probably (with AOMP 13.0-4) behind on clang version.

Login@Machine:~> $AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test
clang-13: error: unsupported option '--offload-arch'
clang-13: error: unsupported option '--offload-arch'
clang-13: error: no such file or directory: 'gfx900'
clang-13: error: no such file or directory: 'gfx906'

No problem, I'm happy to help. Wish guys from HIP and OpenCL on ROCm were as responsive. :) (but I understand, its huge project and its still quite early)

EDIT: Just for completeness here is rocminfo_log.txt to confirm machine configuration.

@saiislam
Copy link
Member

saiislam commented Jul 13, 2021

Hey @FilipVaverka,

"=" is missing between "offload-arch" flag and its value "gfx906". Is it a copying mistake while pasting the commands here?

Login@Machine:~> $AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test

Otherwise, please use the following command and let us know the output:
$AOMP/bin/clang -fopenmp --offload-arch=gfx900 --offload-arch=gfx906 -O3 main.cpp -o test

AOMP 13.0-4 does support this option.

Also, please let us know the output of "offload-arch -c -v"? It is supposed to print all details (including target features) of all active GPUs in the system.

@FilipVaverka
Copy link
Author

Sorry, that was it. I can compile the binary with

Login@Machine:~> $AOMP/bin/clang++ -fopenmp --offload-arch=gfx900 --offload-arch=gfx906 -O3 main.cpp -o test

However, behavior of "test" binary is the same, it runs with ROCR_VISIBLE_DEVICES=1 and fails with other GPU as

Login@Machine:~> ROCR_VISIBLE_DEVICES=0 ./test 
Possible gpu arch mismatch: device:gfx900, image:gfx906 please check compiler flag: -march=<gpu>
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)

Interesting, "offload-arch -c -v" seems to indicate some issue with gfx900 GPU:

Login@Machine:~> $AOMP/bin/offload-arch -c -v
gfx906 VEGA20 1002:66AF amdgcn-amd-amdhsa   sramecc- xnack-
gfx900 VEGA10 1002:687F amdgcn-amd-amdhsa   HSAERROR-INITIALIZATION

Although I don't seem to have any issues with it otherwise (OpenCL ROCm stack works for example).

@gregrodgers
Copy link
Contributor

thanks this helps a lot. It appears we may have two problems. The first is the HSA error on vega 10 with -c. The -c option is the only option in offload-arch that uses HSA. But HSA is needed by the openmp runtime so that is why we are failing to run the application. The 2nd is that when you mask off the GFX906 with .._DEVICES=1 you are seeing strange output " gfx906 sramecc- xnack-" . It should just say gfx900. We need to test on native gfx900. It appears that HSA init is getting called twice and failing.

The openmp runtime is calling "offload-arch -c" and returning bad information. If it returned the correct information, the runtime would be able to choose the correct image.

I hope we can get this fixed in 13.0-5 which will be out by end of the month (July 2021). We have another bug wherein the use of rocm profiler is failing because the runtime is trapping stdout for offload-arch. So we need to move offload-arch to a library call which is a pretty big fix.

Thanks for your patience. For the time being just compile one image and use the mask to select the correct one.

@ppanchad-amd
Copy link

@FilipVaverka Do you still need assistance with this ticket? If not, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants