Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

YawesomeM · 2024-12-11T20:41:31Z

Hello,

By using NVIDIA Nsight Compute to analyze yolov4_darknet, we found non-negligible instruction dependence within this loop CUDA_KERNEL_LOOP. In our evaluation, unrolling this loop (see the code below) can mitigate the perf issue. Because it gives the loop body more instructions, thus increasing the likelihood of hiding dependency-related GPU stalls.

+# pragma unroll 4 // or other proper unroll factors
+for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < n; index += blockDim.x * gridDim.x) {
-CUDA_KERNEL_LOOP(index, n) {

stephanecharette · 2024-12-16T07:50:12Z

Note this repo is no longer maintained. The new Darknet/YOLO repo is here: https://github.com/hank-ai/darknet#table-of-contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

YawesomeM commented Dec 11, 2024

stephanecharette commented Dec 16, 2024

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

Unroll this CUDA_KERNEL_LOOP(index, n) loop to optimize its execution? #8928

Comments

YawesomeM commented Dec 11, 2024

stephanecharette commented Dec 16, 2024