Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MoE FP8 BMM with loopover (pytorch#3147)
Summary: Pull Request resolved: pytorch#3147 X-link: facebookresearch/FBGEMM#239 Enable MoE FP8 rowwise BMM with loopover and benchmarks - MoE FP8 rowwise BMM using loopover with quant ops achieves **1.8x speedup over BF16 BMM with max_autotune** - BF16 BMM with torch.compile max_autotune (enabled in D62278399) can bring up to 2x speedup over torch.bmm (cublas) - Replacing Triton FP8 quantization loopover with 3d brings up to 30x speedup for the quantization op ([data sheet](https://l.facebook.com/l.php?u=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1n-nUuus-XXmKykBvvXs3u0AfTJMoChVHQdG_u3pSPrk%2Fedit%3Fusp%3Dsharing&h=AT2OzkEQcXVJth6Q2-2VfhCNugmuPVWHXjrA2nJor8c2O54xyGQu-9kDB_dE9X2dVC6i8QY97QJp2Ojlb3cAvkmxNvMiajUs-jZ6oZl4gMmPKPNkOgWScdYNtP7geoIy1aTYr21rAszjznNFEYVgf9dPonw)) - For MoE in inference/training with expert parallelism, number of local experts are normally 2, and 4 is the max, such that performance of loopover MoE FP8 BMM is acceptable - Working on enabling customized MoE FP8 BMM kernel which could further improve performance - More results are in the [data sheet](https://docs.google.com/spreadsheets/d/1S-XqBh10G8sZqw97AJq37uy-JV6fxlZIVb2C41eeX38/edit?usp=sharing) {F1873899701} Reviewed By: jianyuh Differential Revision: D62889315 fbshipit-source-id: de87d9757f314974af1b12b9b97a47eeca6e2acc
- Loading branch information