Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete vectorisation of FP16 loads and stores #184

Open
thomasfaingnaert opened this issue Jan 15, 2024 · 0 comments
Open

Incomplete vectorisation of FP16 loads and stores #184

thomasfaingnaert opened this issue Jan 15, 2024 · 0 comments
Assignees

Comments

@thomasfaingnaert
Copy link
Member

Something I noticed while working on optimisations for Volta.

Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. ld.shared.v4.b32, as expected. Occasionally however, they appear to be split in two ld.shared.v4.u16s.

The issue seems to be that sometimes a <8 x i16> load/store is generated rather than a <8 x half> load/store in LLVM IR, and NVPTX refuses to completely vectorise the former.

mwe.ll

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @broken(i16 addrspace(1)* %ptr) {
  store <8 x i16> zeroinitializer, i16 addrspace(1)* %ptr
  ret void
}

define void @working(half addrspace(1)* %ptr) {
  store <8 x half> zeroinitializer, i16 addrspace(1)* %ptr
  ret void
}

llc mwe.ll:

//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_30
.address_size 64

	// .globl	broken                  // -- Begin function broken
                                        // @broken
.visible .func broken(
	.param .b64 broken_param_0
)
{
	.reg .b16 	%rs<2>;
	.reg .b64 	%rd<2>;

// %bb.0:
	ld.param.u64 	%rd1, [broken_param_0];
	mov.u16 	%rs1, 0;
	st.global.v4.u16 	[%rd1+8], {%rs1, %rs1, %rs1, %rs1};
	st.global.v4.u16 	[%rd1], {%rs1, %rs1, %rs1, %rs1};
	ret;
                                        // -- End function
}
	// .globl	working                 // -- Begin function working
.visible .func working(
	.param .b64 working_param_0
)                                       // @working
{
	.reg .b32 	%r<6>;
	.reg .b32 	%hh<2>;
	.reg .b64 	%rd<2>;

// %bb.0:
	ld.param.u64 	%rd1, [working_param_0];
	mov.u32 	%r5, 0;
	mov.b32 	%hh1, %r5;
	st.global.v4.b32 	[%rd1], {%r5, %r5, %r5, %r5};
	ret;
                                        // -- End function
}

We could fix this by using inline assembly, but then we cannot make use of the reg+offset addressing mode of PTX, thereby forcing NVPTX to materialise all pointer operands in registers. I hence did some digging in the backend:
https://github.com/llvm/llvm-project/blob/7cbf1a2591520c2491aa35339f227775f4d3adf6/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L5110-L5122

i.e. i16 is only vectorised up to 4 elements. Luckily, this seems to already be patched in main by llvm/llvm-project#65799, so I'll check if applying that patch fixes it, and if so, backport it to Julia's LLVM version.

@thomasfaingnaert thomasfaingnaert self-assigned this Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant