3xTF32 GEMM example slower than SIMT? #390

masahi · 2021-12-22T20:36:05Z

masahi
Dec 22, 2021

Hi, I'm looking to integrate 3xtf32 kernels into TVM soon. I've tried the 27_ampere_3xtf32_fast_accurate_tensorop_gemm example and here is the output on RTX 3070:

$ examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm                   
Runtime: 16.9899 ms                                                                                                                                                          
GFLOPs: 6825.48                                                                                                                                                              
Normalized L2 norm of                                                                                                                                                        
 - 3xTF32 error with FP64 reference : 3.82082010e-07                                                                                                                         
 - 1xTF32 error with FP64 reference : 2.61451630e-04                                                                                                                         
 - FP32 error with FP64 reference   : 1.14653750e-06                                                                                                                         
                                                                                                                                                                             
CSV results                                                                                                                                                                  
M,N,K,Runtime(ms),GFLOPS,3xTF32_vs_FP64,1xTF32_vs_FP64,FP32_vs_FP64                                                                                                          
3456,4096,4096,1.69898834e+01,6.82548044e+03,3.82082010e-07,2.61451630e-04,1.14653750e-06

GFLOPs: 6825.48 seems low to me. Running the SIMT example in https://github.com/NVIDIA/cutlass#building-one-cuda-core-gemm-kernel gets me this:

$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096                                         
=============================                                                                                                                                                
  Problem ID: 1                                                                                                                                                              
                                                                                                                                                                             
        Provider: CUTLASS                                                                                                                                                    
   OperationKind: gemm                                                                                                                                                       
       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Passed

       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8  \
                  --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024  \
                  

           Bytes: 180355072  bytes
           FLOPs: 115992428544  flops
           FLOPs/Byte: 643

         Runtime: 10.326  ms
          Memory: 16.2666 GiB/s

            Math: 11233.1 GFLOP/s

I also ran a sweep over all SIMT kernels on the same shape, and the slowest one was around 6400 GFLOPs.

Is this expected? @hwu36

Answered by hwu36

Dec 26, 2021

I don't have 3070 spec at hand, but I have 3080 which says FP32 w/o tensor core has the same perf as TF32 w/ tensor core. I also get same performance as yours on 3070. So, I guess there is no point running TF32x3 on Geforce cards.

3xTF32 is mainly for A100, here is what I got on A100 without locking the frequency

[haichengw@ipp1-0245 build]$ ./examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm
Runtime: 4.8890 ms
GFLOPs: 23719.49
Normalized L2 norm of
 - 3xTF32 error with FP64 reference : 3.82082010e-07
 - 1xTF32 error with FP64 reference : 2.61451630e-04
 - FP32 error with FP64 reference   : 1.14653750e-06

CSV results
M,N,K,Runtime(ms),GFLO…

View full answer

hwu36 · 2021-12-26T04:59:33Z

hwu36
Dec 26, 2021
Maintainer

I don't have 3070 spec at hand, but I have 3080 which says FP32 w/o tensor core has the same perf as TF32 w/ tensor core. I also get same performance as yours on 3070. So, I guess there is no point running TF32x3 on Geforce cards.

3xTF32 is mainly for A100, here is what I got on A100 without locking the frequency

[haichengw@ipp1-0245 build]$ ./examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm
Runtime: 4.8890 ms
GFLOPs: 23719.49
Normalized L2 norm of
 - 3xTF32 error with FP64 reference : 3.82082010e-07
 - 1xTF32 error with FP64 reference : 2.61451630e-04
 - FP32 error with FP64 reference   : 1.14653750e-06

CSV results
M,N,K,Runtime(ms),GFLOPS,3xTF32_vs_FP64,1xTF32_vs_FP64,FP32_vs_FP64
3456,4096,4096,4.88897934e+00,2.37194942e+04,3.82082010e-07,2.61451630e-04,1.14653750e-06

24 TFLOPS is well beyond the theoretical peak of FP32 at the max frequency.

BTW, you can also use cutlass profiler to profile cutlass 3xTF32 kernels. The CMake command is

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s1688gemm_1*,cutlass_tensorop_s1688gemm_2*,cutlass_tensorop_s1688gemm_6*

You can change s1688gemm to c1688gemm, s1688fprop, s1688dgrad, s1688wgrad to profile 3xtf32 complex gemm, fprop, dgrad, wgrad kernels.

As always, CUTLASS is a co-design between us and CUDA compiler team. Next CUDA compiler will significantly boost TF32x3 performance.

0 replies

masahi · 2021-12-27T07:36:02Z

masahi
Dec 27, 2021
Author

Thanks, that makes sense. I'm surprised to hear that TF32 is not faster than FP32 on Geforce.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3xTF32 GEMM example slower than SIMT? #390

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

3xTF32 GEMM example slower than SIMT? #390

masahi Dec 22, 2021

Replies: 2 comments

hwu36 Dec 26, 2021 Maintainer

masahi Dec 27, 2021 Author

masahi
Dec 22, 2021

hwu36
Dec 26, 2021
Maintainer

masahi
Dec 27, 2021
Author