FP8 Triton matmul code silently requires contiguous tensors #2713

rationalism · 2024-06-11T03:27:13Z

Hello! Thank you very much for this FP8 rowwise matmul code, it's been extremely helpful. However, there is a subtle bug/hidden requirement when eg. calling this code here:

FBGEMM/fbgemm_gpu/experimental/gemm/test/fp8_gemm_benchmark.py

Line 97 in 735f27b

def run_gemm() -> Tensor:

This works great, but only if the second matrix is contiguous in transposed format (eg. for M, N, K equal to (4,096, 2,048, 1,024), the second matrix must be contiguous in the shape (2,048, 1,024)). If it's not contiguous, the matmul will finish, but the results will be numerically nonsensical.

q10 · 2024-06-11T04:26:50Z

CC @choutim

sryap · 2024-06-20T16:37:04Z

Hello @rationalism, thank you for your questions.

These triton-lang/triton#3952 and pytorch/pytorch#125437 should be related.

rationalism · 2024-07-11T18:43:34Z

@q10 @sryap Tri Dao just released a paper on Flash Attention 3, which also has to deal with contiguous-layout FP8 matmul issues. Might be helpful?

https://tridao.me/publications/flash3/flash3.pdf

Summary: This diff fixes an issue where our triton fp8 quantize functions didnt properly handle non-contiguous inputs. Specifically, they write to the output tensor using the same strides as the input, when the output is always allocated as contiguous. This resulted in the output being unintentionally transposed in some cases. The result of this issue was that non-contiguous inputs would run fine but produce silently transposed outputs. It was noted in github here: pytorch#2713 Adding explicit output strides to the kernel resolves the issue. I also found a small issue with D59248142 where scaling wouldnt be applied when the number of elements was smaller than the blocksize. This caused fp8_gemm_test to fail. I resolved it by extending the check for when to scale. Reviewed By: jianyuh Differential Revision: D60535956

jwfromm · 2024-07-31T21:55:03Z

I think this issue should be resolved in #2919. The quantization kernel in triton was writing output using the same strides as the input but returning a contiguous tensor. This effectively transposed the output tensor. After the fix, it should always return a contiguous output in the proper layout.

Summary: This diff fixes an issue where our triton fp8 quantize functions didnt properly handle non-contiguous inputs. Specifically, they write to the output tensor using the same strides as the input, when the output is always allocated as contiguous. This resulted in the output being unintentionally transposed in some cases. The result of this issue was that non-contiguous inputs would run fine but produce silently transposed outputs. It was noted in github here: pytorch#2713 Adding explicit output strides to the kernel resolves the issue. I also found a small issue with D59248142 where scaling wouldnt be applied when the number of elements was smaller than the blocksize. This caused fp8_gemm_test to fail. I resolved it by extending the check for when to scale. Reviewed By: jianyuh Differential Revision: D60535956

Summary: Pull Request resolved: pytorch#2919 This diff fixes an issue where our triton fp8 quantize functions didnt properly handle non-contiguous inputs. Specifically, they write to the output tensor using the same strides as the input, when the output is always allocated as contiguous. This resulted in the output being unintentionally transposed in some cases. The result of this issue was that non-contiguous inputs would run fine but produce silently transposed outputs. It was noted in github here: pytorch#2713 Adding explicit output strides to the kernel resolves the issue. I also found a small issue with D59248142 where scaling wouldnt be applied when the number of elements was smaller than the blocksize. This caused fp8_gemm_test to fail. I resolved it by extending the check for when to scale. Reviewed By: jianyuh Differential Revision: D60535956

Summary: Pull Request resolved: #2919 This diff fixes an issue where our triton fp8 quantize functions didnt properly handle non-contiguous inputs. Specifically, they write to the output tensor using the same strides as the input, when the output is always allocated as contiguous. This resulted in the output being unintentionally transposed in some cases. The result of this issue was that non-contiguous inputs would run fine but produce silently transposed outputs. It was noted in github here: #2713 Adding explicit output strides to the kernel resolves the issue. I also found a small issue with D59248142 where scaling wouldnt be applied when the number of elements was smaller than the blocksize. This caused fp8_gemm_test to fail. I resolved it by extending the check for when to scale. Reviewed By: jianyuh Differential Revision: D60535956 fbshipit-source-id: 0c449e921e2703f2275e24028238f83fec1c0427

jwfromm mentioned this issue Jul 31, 2024

Fix triton fp8 handling of non-contiguous inputs #2919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 Triton matmul code silently requires contiguous tensors #2713

FP8 Triton matmul code silently requires contiguous tensors #2713

rationalism commented Jun 11, 2024

q10 commented Jun 11, 2024

sryap commented Jun 20, 2024

rationalism commented Jul 11, 2024

jwfromm commented Jul 31, 2024

FP8 Triton matmul code silently requires contiguous tensors #2713

FP8 Triton matmul code silently requires contiguous tensors #2713

Comments

rationalism commented Jun 11, 2024

q10 commented Jun 11, 2024

sryap commented Jun 20, 2024

rationalism commented Jul 11, 2024

jwfromm commented Jul 31, 2024