You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By using NVIDIA Nsight Compute to analyze yolov4_darknet, we found non-negligible instruction dependence within this loop CUDA_KERNEL_LOOP. In our evaluation, unrolling this loop (see the code below) can mitigate the perf issue. Because it gives the loop body more instructions, thus increasing the likelihood of hiding dependency-related GPU stalls.
+# pragma unroll 4 // or other proper unroll factors
+for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < n; index += blockDim.x * gridDim.x) {
-CUDA_KERNEL_LOOP(index, n) {
The text was updated successfully, but these errors were encountered:
Hello,
By using NVIDIA Nsight Compute to analyze yolov4_darknet, we found non-negligible instruction dependence within this loop CUDA_KERNEL_LOOP. In our evaluation, unrolling this loop (see the code below) can mitigate the perf issue. Because it gives the loop body more instructions, thus increasing the likelihood of hiding dependency-related GPU stalls.
+# pragma unroll 4 // or other proper unroll factors
+for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < n; index += blockDim.x * gridDim.x) {
-CUDA_KERNEL_LOOP(index, n) {
The text was updated successfully, but these errors were encountered: