Performance Issue in Writing Reduced Results to Global Memory #6

ziyuhuang123 · 2024-09-19T14:35:41Z

blx.x, blx.y
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.554967ms (1934.78 GB/s)
Trial 1 Completed in 0.182786ms (5874.31 GB/s)
Trial 2 Completed in 0.179789ms (5972.23 GB/s)
Trial 3 Completed in 0.180768ms (5939.89 GB/s)
Trial 4 Completed in 0.181476ms (5916.72 GB/s)
Trial 5 Completed in 0.181638ms (5911.44 GB/s)
Trial 6 Completed in 0.180911ms (5935.19 GB/s)
Trial 7 Completed in 0.18125ms (5924.09 GB/s)
Trial 8 Completed in 0.179573ms (5979.42 GB/s)
Trial 9 Completed in 0.180553ms (5946.96 GB/s)
Success 2097152, Fail 0


blx.x, 0
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.6632ms (1619.03 GB/s)
Trial 1 Completed in 0.293118ms (3663.17 GB/s)
Trial 2 Completed in 0.291583ms (3682.46 GB/s)
Trial 3 Completed in 0.292431ms (3671.78 GB/s)
Trial 4 Completed in 0.292064ms (3676.39 GB/s)
Trial 5 Completed in 0.292127ms (3675.6 GB/s)
Trial 6 Completed in 0.29137ms (3685.15 GB/s)
Trial 7 Completed in 0.292178ms (3674.96 GB/s)
Trial 8 Completed in 0.29203ms (3676.82 GB/s)
Trial 9 Completed in 0.292341ms (3672.91 GB/s)
Success 2097152, Fail 0

When writing the final results to global memory, if using a conventional STORE, the results should be written to the address corresponding to blx.x, blx.y. However, since we are performing a reduction, the results should be written to the address (blx.x, 0), as the entire row is being reduced to one block.

Surprisingly, using the (blx.x, blx.y) address is much faster (5946.96 GB/s vs. 3672.91 GB/s) and the results are also correct, based on multiple measurements (with dimensions M = N = 16384).

However, I'm concerned that using (blx.x, blx.y) might write to incorrect variables, despite the performance improvement.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue in Writing Reduced Results to Global Memory #6

Performance Issue in Writing Reduced Results to Global Memory #6

ziyuhuang123 commented Sep 19, 2024

Performance Issue in Writing Reduced Results to Global Memory #6

Performance Issue in Writing Reduced Results to Global Memory #6

Comments

ziyuhuang123 commented Sep 19, 2024