You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have the following minimal snippet of code with a GEMM-K style reduction that gives me wrong results. I hope to receive some help in understanding what I am doing wrong, probably with the semaphore.
Algorithm: I am just doing something very dumb:
I invoke my kernel with grid_dim = {1, 1, 10} where 10 is a number I made up.
Each CTA has 128 threads performing a WGMMA, but it's not important. The 10 CTAs in my grid just replicate the workload.
In the Epilogue snippet below, I want all the CTAs to sum their result into the global output tensor gO.
By doing so, I expect the output to be 10x the normal output, but that is not the case.
Here's the code:
template <classTiledMmaPV, classRmemTensor, classGmemTensor>
CUTE_DEVICE
voidreduce_split_kv(
RmemTensor rO, // per thread fragment. holds the correct output of a GEMM
GmemTensor cta_gO, // per CTA tile from gOint* semaphore_mem) {
auto thr_idx = static_cast<int>(threadIdx.x); // semaphore expects `int`.auto cta_idx = static_cast<int>(blockIdx.z); // semaphore expects `int`.auto semaphore = cutlass::Semaphore{semaphore_mem + blockIdx.z * 128, thr_idx};
semaphore.fetch();
auto thr_mma_pv = tiled_mma_pv.get_slice(thr_idx);
auto thr_gO = thr_mma_pv.partition_C(cta_gO);
semaphore.wait(cta_idx);
for (int i = 0; i < size(rO); ++i) {
rO[i] += thr_gO[i];
}
cute::copy(rO, thr_gO);
int lock;
if (cta_idx == blockDim.z - 1) {
lock = 0;
} else {
lock = cta_idx + 1;
}
semaphore.release(lock);
}
I tried to follow the code in gemm_with_k_reduction:
(but of course with different semantics in the semaphore).
I do understand that asking for code reading is a huge favor, but I have received a lot of wonderful favor from this forum, so I hope this is another lucky day for me.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have the following minimal snippet of code with a GEMM-K style reduction that gives me wrong results. I hope to receive some help in understanding what I am doing wrong, probably with the
semaphore
.Algorithm: I am just doing something very dumb:
grid_dim = {1, 1, 10}
where10
is a number I made up.128
threads performing a WGMMA, but it's not important. The10
CTAs in my grid just replicate the workload.gO
.By doing so, I expect the output to be
10x
the normal output, but that is not the case.Here's the code:
I tried to follow the code in
gemm_with_k_reduction
:cutlass/include/cutlass/gemm/kernel/gemm_with_k_reduction.h
Lines 577 to 590 in ffa34e7
(but of course with different semantics in the semaphore).
I do understand that asking for code reading is a huge favor, but I have received a lot of wonderful favor from this forum, so I hope this is another lucky day for me.
Thank you, in advance, for your help.
Beta Was this translation helpful? Give feedback.
All reactions