compiler inserts arrive and wait for each GMMA instruction #1645

DeMoriarty · 2024-07-19T19:00:24Z

DeMoriarty
Jul 19, 2024

In a GEMM kernel that I'm modifying, I noticed that each HGMMA instruction is being waited upon immediately:

      ...
      WARPGROUP.ARRIVE 
      IADD3 R151, R152, 0x402, RZ 
      HGMMA.64x128x16.F16 R56, gdesc[UR8], R56, gsb0 
      WARPGROUP.DEPBAR.LE gsb0, 0x0
      ...
      WARPGROUP.ARRIVE 
      IADD3 R151, R152, 0x404, RZ 
      IADD3 R152, R152, 0x406, RZ 
      HGMMA.64x128x16.F16 R56, gdesc[UR8], R56, gsb0 
      WARPGROUP.DEPBAR.LE gsb0, 0x0
      ...

But in the CUDA source code, the HGMMA instructions are committed in batches:

      warpgroup_fence_operand(accum);
      warpgroup_arrive();
      CUTLASS_PRAGMA_UNROLL
      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
        cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accum);
        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
      }
      warpgroup_commit_batch();
      warpgroup_wait<0>()

What might be causing the compiler to insert these DEPBAR.LE & ARRIVE?

thakkarV · 2024-07-19T19:33:07Z

thakkarV
Jul 19, 2024
Collaborator

when you compile this kernel, you must be getting some warnings from ptxas about serialization of the WGMMA instructions. What does it say?

12 replies

thakkarV Jul 24, 2024
Collaborator

Please try CUDA 12.3? There's some compiler bugs around overzealous mma synchro in newer toolkits

DeMoriarty Jul 24, 2024
Author

tried with 12.3.0, still the same :/

thakkarV Jul 24, 2024
Collaborator

wgmma.mma_async instructions are serialized due to non wgmma instructions defining accumulator registers of a wgmma between start and end of the pipeline stage in the function

This error suggests something in your C++ code is touching the A and/or accumulator registers of the MMA in between [operand fence before the MMA, wait group and operand fence after the mma]. (this would be a bug in your source code)

If not in C++, it is likely to be in the generated PTX (this would be an NVVM bug)

If not in PTX, it is likely to be during the compilation to SASS (this would be a ptxas bug)

Without more info, like the C++ or ptx source, I am not sure I can help more, but I encourage you to file the CUDA bugs for the appropriate modules

DeMoriarty Jul 24, 2024
Author

Thanks a lot for your help! do you think any of the compiler flags could be causing this?

--expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -forward-unknown-to-host-compiler -std=c++17 -O3 -Wunused -Xcompiler=-Wconversion -Xcompiler=-fPIC -Xcompiler=-fno-strict-aliasing --expt-relaxed-constexpr -DNDEBUG --use_fast_math --expt-extended-lambda -lineinfo -res-usage -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -gencode arch=compute_90a,code=sm_90a -DCUTLASS_DEBUG_TRACE_LEVEL=0 '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -D_GLIBCXX_USE_CXX11_ABI=0

thakkarV Jul 24, 2024
Collaborator

Try removing -fPIC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compiler inserts arrive and wait for each GMMA instruction #1645

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

compiler inserts arrive and wait for each GMMA instruction #1645

DeMoriarty Jul 19, 2024

Replies: 1 comment · 12 replies

thakkarV Jul 19, 2024 Collaborator

thakkarV Jul 24, 2024 Collaborator

DeMoriarty Jul 24, 2024 Author

thakkarV Jul 24, 2024 Collaborator

DeMoriarty Jul 24, 2024 Author

thakkarV Jul 24, 2024 Collaborator

DeMoriarty
Jul 19, 2024

Replies: 1 comment 12 replies

thakkarV
Jul 19, 2024
Collaborator

thakkarV Jul 24, 2024
Collaborator

DeMoriarty Jul 24, 2024
Author

thakkarV Jul 24, 2024
Collaborator

DeMoriarty Jul 24, 2024
Author

thakkarV Jul 24, 2024
Collaborator