-
Hi, I'm looking to integrate 3xtf32 kernels into TVM soon. I've tried the
I also ran a sweep over all SIMT kernels on the same shape, and the slowest one was around 6400 GFLOPs. Is this expected? @hwu36 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I don't have 3070 spec at hand, but I have 3080 which says FP32 w/o tensor core has the same perf as TF32 w/ tensor core. I also get same performance as yours on 3070. So, I guess there is no point running TF32x3 on Geforce cards. 3xTF32 is mainly for A100, here is what I got on A100 without locking the frequency
24 TFLOPS is well beyond the theoretical peak of FP32 at the max frequency. BTW, you can also use cutlass profiler to profile cutlass 3xTF32 kernels. The CMake command is
You can change As always, CUTLASS is a co-design between us and CUDA compiler team. Next CUDA compiler will significantly boost TF32x3 performance. |
Beta Was this translation helpful? Give feedback.
-
Thanks, that makes sense. I'm surprised to hear that TF32 is not faster than FP32 on Geforce. |
Beta Was this translation helpful? Give feedback.
I don't have 3070 spec at hand, but I have 3080 which says FP32 w/o tensor core has the same perf as TF32 w/ tensor core. I also get same performance as yours on 3070. So, I guess there is no point running TF32x3 on Geforce cards.
3xTF32 is mainly for A100, here is what I got on A100 without locking the frequency