You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
blx.x, blx.y
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.554967ms (1934.78 GB/s)
Trial 1 Completed in 0.182786ms (5874.31 GB/s)
Trial 2 Completed in 0.179789ms (5972.23 GB/s)
Trial 3 Completed in 0.180768ms (5939.89 GB/s)
Trial 4 Completed in 0.181476ms (5916.72 GB/s)
Trial 5 Completed in 0.181638ms (5911.44 GB/s)
Trial 6 Completed in 0.180911ms (5935.19 GB/s)
Trial 7 Completed in 0.18125ms (5924.09 GB/s)
Trial 8 Completed in 0.179573ms (5979.42 GB/s)
Trial 9 Completed in 0.180553ms (5946.96 GB/s)
Success 2097152, Fail 0
blx.x, 0
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.6632ms (1619.03 GB/s)
Trial 1 Completed in 0.293118ms (3663.17 GB/s)
Trial 2 Completed in 0.291583ms (3682.46 GB/s)
Trial 3 Completed in 0.292431ms (3671.78 GB/s)
Trial 4 Completed in 0.292064ms (3676.39 GB/s)
Trial 5 Completed in 0.292127ms (3675.6 GB/s)
Trial 6 Completed in 0.29137ms (3685.15 GB/s)
Trial 7 Completed in 0.292178ms (3674.96 GB/s)
Trial 8 Completed in 0.29203ms (3676.82 GB/s)
Trial 9 Completed in 0.292341ms (3672.91 GB/s)
Success 2097152, Fail 0
When writing the final results to global memory, if using a conventional STORE, the results should be written to the address corresponding to blx.x, blx.y. However, since we are performing a reduction, the results should be written to the address (blx.x, 0), as the entire row is being reduced to one block.
Surprisingly, using the (blx.x, blx.y) address is much faster (5946.96 GB/s vs. 3672.91 GB/s) and the results are also correct, based on multiple measurements (with dimensions M = N = 16384).
However, I'm concerned that using (blx.x, blx.y) might write to incorrect variables, despite the performance improvement.
The text was updated successfully, but these errors were encountered:
When writing the final results to global memory, if using a conventional STORE, the results should be written to the address corresponding to blx.x, blx.y. However, since we are performing a reduction, the results should be written to the address (blx.x, 0), as the entire row is being reduced to one block.
Surprisingly, using the (blx.x, blx.y) address is much faster (5946.96 GB/s vs. 3672.91 GB/s) and the results are also correct, based on multiple measurements (with dimensions M = N = 16384).
However, I'm concerned that using (blx.x, blx.y) might write to incorrect variables, despite the performance improvement.
The text was updated successfully, but these errors were encountered: