You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#include"common.h"__global__voidknl(int* out, int filter) {
int x[1024];
x[filter] = 0;
if (threadIdx.x < filter) out[threadIdx.x] = x[threadIdx.x];
}
intmain() {
knl<<<1, 1>>>(nullptr, 0);
hipCheck(hipDeviceSynchronize());
}
and am wondering why x would spill into global memory (the documentation reads: that cannot reasonably fit into registers):
the stack is backed by global memory
Using hipGetDeviceProperties on MI250, we see that regsPerBlock is 65536 registers (32-bits each). And 1024 < 65536, and we are using a single thread block, with a single thread. So why are we spilling? Reading rocprofiler-compute doc as well, VGPR seem to be in the 10s or 100s of KB, so I am surprised.
Is it because that since the warp size for Instinct is 64, we can't really schedule a single thread and we are scheduling in reality behind the scenes 64 threads, requiring 65536 32-bit registers? I guess this is not the case, as I guess we would have branching for the 63 other threads, and they would just sit idle no?
Thank you!
Additional context
No response
The text was updated successfully, but these errors were encountered:
Describe your question
Hi, I am reading https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/tutorial/profiling-by-example.html#spill-scratch-buffer,
and am wondering why
x
would spill into global memory (the documentation reads:that cannot reasonably fit into registers
):Using
hipGetDeviceProperties
on MI250, we see thatregsPerBlock
is 65536 registers (32-bits each). And 1024 < 65536, and we are using a single thread block, with a single thread. So why are we spilling? Reading rocprofiler-compute doc as well, VGPR seem to be in the 10s or 100s of KB, so I am surprised.Is it because that since the warp size for Instinct is 64, we can't really schedule a single thread and we are scheduling in reality behind the scenes 64 threads, requiring 65536 32-bit registers? I guess this is not the case, as I guess we would have branching for the 63 other threads, and they would just sit idle no?
Thank you!
Additional context
No response
The text was updated successfully, but these errors were encountered: