-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Identical LLVM IR file with different basic block ordering cause miscompilation #109391
Comments
@llvm/issue-subscribers-backend-amdgpu Author: Valentin Churavy (vchuravy)
Reduced from https://github.com/JuliaGPU/AMDGPU.jl/issues/672#issuecomment-2347151487
The code is a double-nested loops and the bug manifests as if we were skipping a loop. I have two small LLVM modules that are identical, except that in one I manually reorder the BBs to follow the order of the "working" version. https://godbolt.org/z/sscdTK7d7 https://gist.github.com/vchuravy/0c60cf4b9c497f6c8050f2a1137cd399
Lastly, I encountered this on LLVM 15, and it also reproduces on LLVM 16. LLVM 17 either hides or has this bug fixed. |
@llvm/issue-subscribers-julialang Author: Valentin Churavy (vchuravy)
Reduced from https://github.com/JuliaGPU/AMDGPU.jl/issues/672#issuecomment-2347151487
The code is a double-nested loops and the bug manifests as if we were skipping a loop. I have two small LLVM modules that are identical, except that in one I manually reorder the BBs to follow the order of the "working" version. https://godbolt.org/z/sscdTK7d7 https://gist.github.com/vchuravy/0c60cf4b9c497f6c8050f2a1137cd399
Lastly, I encountered this on LLVM 15, and it also reproduces on LLVM 16. LLVM 17 either hides or has this bug fixed. |
Does this reproduce with main? I see no difference in 17.0 or later: https://godbolt.org/z/Es9jdoYsd I'm guessing this is a gfx11 specific issue. Newer versions don't have the s_sendmsg sendmsg(MSG_DEALLOC_VGPRS) at the end, so this might just be one of the early deallocation bugs that was fixed. cc @jayfoad |
The original reporter had a gfx1102 and I reproduced it on a gfx1103. As far as I can tell this doesn't reproduce on main. I hadn't had the time to bisect. This is on an LTS release for us, and I would like to Backports the fix if possible. |
The reason for reimplementing it was as a basis for implementing this fairly important bug fix: 4b6d41c But from reading the description above ("the bug manifests as if we were skipping a loop") it's not clear to me how this is related to dealloc vgprs. |
I just noticed in the newer versions, there is no more dealloc. The only difference between the pass/fail in the 16 diff was a few register assignments and one s_delay_alu |
Reduced from JuliaGPU/AMDGPU.jl#672 (comment)
The code is a double-nested loops and the bug manifests as if we were skipping a loop.
In one version of the code after optimization (specifically DCE) the basic blocks end up in a different order.
I have two small LLVM modules that are identical, except that in one I manually reorder the BBs to follow the order of the "working" version.
https://godbolt.org/z/sscdTK7d7
https://gist.github.com/vchuravy/0c60cf4b9c497f6c8050f2a1137cd399
broken.ll
is the original file that is exhibiting the miscompiation andbroken.reorder.ll
is the file that emits the same code as the working MWE.Lastly, I encountered this on LLVM 15, and it also reproduces on LLVM 16. LLVM 17 either hides or has this bug fixed.
The text was updated successfully, but these errors were encountered: