Clspv Fragments access to global memory by the Smallest access size #1329

BukeBeyond · 2024-03-21T19:53:49Z

From the following simple code:

struct S
{ 
    float n1;
    float n2;
    bool b;
};

kernel void Kernel(global struct S* s)
{
    s->n1 = s->n2 + 1;
}

Clspv will produce compact and efficient Spirv. There is one Load, one Add, and one Store.

         %21 = OpAccessChain %_ptr_StorageBuffer_float %13 %uint_0 %uint_1
         %23 = OpLoad %float %21
         %25 = OpFAdd %float %23 %float_1
         %26 = OpAccessChain %_ptr_StorageBuffer_float %13 %uint_0 %uint_0
               OpStore %26 %25

https://godbolt.org/z/WGxaa646x

However, with the following variation;

kernel void Kernel(global struct S* s)
{
    if (s->b) s->n1 = s->n2 + 1;
}

Clspv will explode the Spirv instructions to load the first float in 4 byte pieces, combine the pieces with binary arithmetic, add 1, and then split the result into 4 byte pieces again and store the pieces with 4 more instructions:

         %33 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_4
         %34 = OpLoad %uchar %33
         %36 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_5
         %37 = OpLoad %uchar %36
         %39 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_6
         %40 = OpLoad %uchar %39
         %42 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_7
         %43 = OpLoad %uchar %42
         %46 = OpCompositeInsert %v4uchar %34 %45 0
         %47 = OpCompositeInsert %v4uchar %37 %46 1
         %48 = OpCompositeInsert %v4uchar %40 %47 2
         %49 = OpCompositeInsert %v4uchar %43 %48 3
         %51 = OpBitcast %float %49
         %53 = OpFAdd %float %51 %float_1
         %54 = OpBitcast %v4uchar %53
         %55 = OpCompositeExtract %uchar %54 0
         %56 = OpCompositeExtract %uchar %54 1
         %57 = OpCompositeExtract %uchar %54 2
         %58 = OpCompositeExtract %uchar %54 3
         %59 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_0
               OpStore %59 %55
         %61 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_1
               OpStore %61 %56
         %63 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_2
               OpStore %63 %57
         %65 = OpAccessChain %_ptr_StorageBuffer_uchar %13 %uint_0 %uint_3
               OpStore %65 %58

https://godbolt.org/z/j3qoP46j9

Why did Clspv do that? It is because of an old restriction in Vulkan.

4 years ago, before Vulkan 1.2 and Physical Addressing were available, a buffer could only have a single typed pointer. So if you have to access a byte and a float from memory, Clspv has to choose the smallest one, the byte, and fragment every other access to that minimum size. A bool is stored as one byte, so any other access to global memory (which is what most kernels use), will load and store floats in 4 pieces, an i64 in 8 pieces, and so on...

This is a rather bad situation than it seems at first! Because recent benchmarks here #1292, have revealed that this access fragmentation can cause upto a 30% penalty in performance. We also do not have any confirmation if driver compilers reverse this kind of fragmentation and if so, completely. The measured performance penalties suggest otherwise.

Now we have Physical Addressing, Clspv is free to create multiple pointer types, in this case, one needed for accessing floats, and another needed for accessing bools. Clspv and Spriv can switch pointer types and give them any physical address as needed. The loads and stores can be done sanely, and without any fragmentation.

However, Clspv has decided not to implement this modern feature so far.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clspv Fragments access to global memory by the Smallest access size #1329

Clspv Fragments access to global memory by the Smallest access size #1329

BukeBeyond commented Mar 21, 2024

Clspv Fragments access to global memory by the Smallest access size #1329

Clspv Fragments access to global memory by the Smallest access size #1329

Comments

BukeBeyond commented Mar 21, 2024