Need help debugging and understanding the behavior of CUTLASS semaphore #1392

hyhieu · 2024-03-11T05:09:57Z

hyhieu
Mar 11, 2024

I have the following minimal snippet of code with a GEMM-K style reduction that gives me wrong results. I hope to receive some help in understanding what I am doing wrong, probably with the semaphore.

Algorithm: I am just doing something very dumb:

I invoke my kernel with grid_dim = {1, 1, 10} where 10 is a number I made up.
Each CTA has 128 threads performing a WGMMA, but it's not important. The 10 CTAs in my grid just replicate the workload.
In the Epilogue snippet below, I want all the CTAs to sum their result into the global output tensor gO.
By doing so, I expect the output to be 10x the normal output, but that is not the case.

Here's the code:

template <class TiledMmaPV, class RmemTensor, class GmemTensor>
CUTE_DEVICE
void reduce_split_kv(
    RmemTensor rO,      // per thread fragment. holds the correct output of a GEMM
    GmemTensor cta_gO,  // per CTA tile from gO
    int* semaphore_mem) {
    auto thr_idx = static_cast<int>(threadIdx.x);  // semaphore expects `int`.
    auto cta_idx = static_cast<int>(blockIdx.z);   // semaphore expects `int`.

    auto semaphore = cutlass::Semaphore{semaphore_mem + blockIdx.z * 128, thr_idx};
    semaphore.fetch();

    auto thr_mma_pv = tiled_mma_pv.get_slice(thr_idx);
    auto thr_gO = thr_mma_pv.partition_C(cta_gO);

    semaphore.wait(cta_idx);

    for (int i = 0; i < size(rO); ++i) {
       rO[i] += thr_gO[i];
    }

    cute::copy(rO, thr_gO);

    int lock;
    if (cta_idx == blockDim.z - 1) {
        lock = 0;
    } else {
        lock = cta_idx + 1;
    }
    semaphore.release(lock);
}

I tried to follow the code in gemm_with_k_reduction:

cutlass/include/cutlass/gemm/kernel/gemm_with_k_reduction.h

Lines 577 to 590 in ffa34e7

    
           Semaphore semaphore(params.semaphore + block_idx, thread_idx); 
        
           if (params.mode == GemmUniversalMode::kGemm) { 
        
             // If performing a reduction via split-K, fetch the initial synchronization 
        
             if (params.grid_tiled_shape.k() > 1) { 
        
               // Fetch the synchronization lock initially but do not block. 
        
               semaphore.fetch(); 
        
               // Indicate which position in a serial reduction the output operator is currently updating 
        
               output_op.set_k_partition(threadblock_tile_offset.k(), params.grid_tiled_shape.k()); 
        
             } 
        
           }

(but of course with different semantics in the semaphore).

I do understand that asking for code reading is a huge favor, but I have received a lot of wonderful favor from this forum, so I hope this is another lucky day for me.

Thank you, in advance, for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help debugging and understanding the behavior of CUTLASS semaphore #1392

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Need help debugging and understanding the behavior of CUTLASS semaphore #1392

hyhieu Mar 11, 2024

Replies: 0 comments

hyhieu
Mar 11, 2024