You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Something I noticed while working on optimisations for Volta.
Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. ld.shared.v4.b32, as expected. Occasionally however, they appear to be split in two ld.shared.v4.u16s.
The issue seems to be that sometimes a <8 x i16> load/store is generated rather than a <8 x half> load/store in LLVM IR, and NVPTX refuses to completely vectorise the former.
mwe.ll
targetdatalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"targettriple = "nvptx64-nvidia-cuda"definevoid@broken(i16addrspace(1)* %ptr) {
store <8 x i16> zeroinitializer, i16addrspace(1)* %ptrretvoid
}
definevoid@working(halfaddrspace(1)* %ptr) {
store <8 x half> zeroinitializer, i16addrspace(1)* %ptrretvoid
}
i.e. i16 is only vectorised up to 4 elements. Luckily, this seems to already be patched in main by llvm/llvm-project#65799, so I'll check if applying that patch fixes it, and if so, backport it to Julia's LLVM version.
The text was updated successfully, but these errors were encountered:
Something I noticed while working on optimisations for Volta.
Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g.
ld.shared.v4.b32
, as expected. Occasionally however, they appear to be split in twold.shared.v4.u16
s.The issue seems to be that sometimes a
<8 x i16>
load/store is generated rather than a<8 x half>
load/store in LLVM IR, and NVPTX refuses to completely vectorise the former.mwe.ll
llc mwe.ll
:We could fix this by using inline assembly, but then we cannot make use of the reg+offset addressing mode of PTX, thereby forcing NVPTX to materialise all pointer operands in registers. I hence did some digging in the backend:
https://github.com/llvm/llvm-project/blob/7cbf1a2591520c2491aa35339f227775f4d3adf6/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L5110-L5122
i.e.
i16
is only vectorised up to 4 elements. Luckily, this seems to already be patched inmain
by llvm/llvm-project#65799, so I'll check if applying that patch fixes it, and if so, backport it to Julia's LLVM version.The text was updated successfully, but these errors were encountered: