Fix CK Profiler Build and Tune Small CK FP8 Shapes (pytorch#3017)

Summary: Pull Request resolved: pytorch#3017 X-link: facebookresearch/FBGEMM#113 A recent bump to CK broke the profiler build, but excluding the problematic targets resolves the issue. I also snuck in two improvements to the CK shape dispatch, the most significant of which doubles the performance for [64, 1280, 8192], which may be impactful for Llama70B. Reviewed By: jianyuh Differential Revision: D61558684 fbshipit-source-id: c4865c8a04ee14bd9fb9e81188cb69f2989b5da0
q10 · Aug 21, 2024 · 1c8ae9d · 1c8ae9d
1 parent 2e190b4
commit 1c8ae9d
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip
@@ -51,7 +51,7 @@ static const std::unordered_map<
         {{32, 1280, 8192},
          fp8_rowwise_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2},
         {{64, 1280, 8192},
-         fp8_rowwise_128x64x32x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2},
+         fp8_rowwise_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2},
         {{128, 1280, 8192},
          fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2},
         // Support for decode across batch sizes for [8192, 1024]
@@ -60,7 +60,7 @@ static const std::unordered_map<
         {{32, 8192, 1024},
          fp8_rowwise_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2},
         {{64, 8192, 1024},
-         fp8_rowwise_128x64x32x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2},
+         fp8_rowwise_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2},
         {{128, 8192, 1024},
          fp8_rowwise_256x64x64x128_32x32_1x1_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3},
         // Support for decode across batch sizes for [7168, 8192]