Skip to content

3xTF32 GEMM example slower than SIMT? #390

Answered by hwu36
masahi asked this question in Q&A
Discussion options

You must be logged in to vote

I don't have 3070 spec at hand, but I have 3080 which says FP32 w/o tensor core has the same perf as TF32 w/ tensor core. I also get same performance as yours on 3070. So, I guess there is no point running TF32x3 on Geforce cards.

3xTF32 is mainly for A100, here is what I got on A100 without locking the frequency

[haichengw@ipp1-0245 build]$ ./examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm
Runtime: 4.8890 ms
GFLOPs: 23719.49
Normalized L2 norm of
 - 3xTF32 error with FP64 reference : 3.82082010e-07
 - 1xTF32 error with FP64 reference : 2.61451630e-04
 - FP32 error with FP64 reference   : 1.14653750e-06

CSV results
M,N,K,Runtime(ms),GFLO…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by masahi
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants