gemm fp8 e4m3 #185

AndreSlavescu · 2024-08-31T08:12:40Z

Summary

Implemented FP8 gemm with E4M3 representation for FP8.

Testing Done

tested square matrices of varying sizes (64, 256, 512, 1024, 2048) + non-square matrices of varying sizes and compared against torch matmul with appropriate casting for backward (torch.matmul doesn't support fp8_e4m3 dtype for backward).

FP8 gemm will only work on SM_89+

Hardware Type: RTX 4090
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

lancerts · 2024-09-04T05:05:43Z

benchmark/gemm_split_k_fp8_speed/gemm-split-k-fp8-full-speed-benchmark.png

why are there multiple lines for each configuration? should be 1 line per case?

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py

qingquansong · 2024-09-04T05:21:23Z

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py

+    # cast to FP8
+    # structure:
+    #   | 1 bit sign | 4 bit exponent | 3 bit mantissa |
+    a, b = a.to(torch.float8_e4m3fn), b.to(torch.float8_e4m3fn)


Directly casting to FP8 is not a good option since it will cause huge performance drop when our input is not fp8 originally. A preferred way is to do dynamic (or static fp8 quantization) so you can extract a scale factor (in precision fp32) to conduct scaled matmul.

Note that: ultimately we wanna fuse the scaling part into the kernel as well so we can reduce the overhead of quantization and dequantization.

qingquansong · 2024-09-04T05:30:46Z

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py

+    grid = (total_programs_mn, total_programs_k)
+
+    c = torch.zeros((m, n), device=a.device, dtype=torch.float16)
+    gemm_split_k_kernel_forward[grid](


Could you add some comments to explain the reason of using split_k implementation such as in which scenario it's preferred?

qingquansong · 2024-09-04T05:40:04Z

benchmark/benchmark_gemm_split_k_fp8_e4m3.py

+        return LigerFP8GemmSplitKFunction.apply(a_fp8, b_fp8)
+
+    def fwd_torch():
+        return torch.matmul(a_float, b_float)


Comparing the speed/memory the fp8 kernel with torch matmul on fp32 is not a quite fair comparison. A better comparison would be to compare with torch._scaled_mm with fp8 matmul such as the example here: https://gist.github.com/malfet/7874d96b99670c3da83cbb779ab770c6

qingquansong · 2024-09-04T05:42:06Z

benchmark/benchmark_gemm_split_k_fp8_e4m3.py

+    ]
+)
+def bench_memory_gemm_split_k_fp8(m, k, n, provider, dtype, device="cuda"):
+    a_fp8 = torch.randn((m, k), device=device, dtype=dtype).to(torch.float8_e4m3fn)


ditto. Let's try to create bf16 input and compare the speed/memory of torch._scaled_mm v.s. the fp8 kernel and then compare the joint time of quant + dequant + matmul (with fp8 scaled factor) Thanks!

qingquansong · 2024-09-04T05:44:59Z

Thanks for the efforts! Provided some comments and please take a look and let me know if there're any questions and we can discuss more here or on discord.

…into matmulfp8 merge main into matmulfp8

yundai424 · 2024-09-04T20:34:59Z

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py

@@ -0,0 +1,302 @@
+# adapted from: https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/inference/fp8/splitk_gemm_fp8.py


should we adapt the BSD3 license header from the original repo? @ByronHsu

we can check with legal internally

src/liger_kernel/ops/utils.py

ByronHsu · 2024-09-07T00:04:29Z

@qingquansong can you take another look

qingquansong · 2024-09-07T00:24:36Z

Hey @AndreSlavescu @ByronHsu considering this is a bit different from the original feature request, maybe we can change the PR description a bit to clarify this and also correct the test cases to compare with torch fp8 matmul? (or at least torch bf16 matmul rather than fp32 before checking in) Let me know your thoughts and happy to discuss more.

…rnel into matmulfp8

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py

benchmark/benchmark_gemm_split_k_fp8_e4m3.py

test/transformers/test_gemm_split_k_fp8_e4m3.py

AdamLouly · 2024-09-11T17:54:07Z

test/transformers/test_gemm_split_k_fp8_e4m3.py

I tried to run this using 8 x H100 and the test failed because of memory constraints:

FAILED test_gemm.py::test_gemm_split_k[dtype0-0.2-0.2-1024-1024-1024] - triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 327680, Hardware limit: 232448. Reducing block sizes or `num_stages` may help. FAILED test_gemm.py::test_gemm_split_k[dtype0-0.2-0.2-1024-2048-1024] - triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 327680, Hardware limit: 232448. Reducing block sizes or `num_stages` may help.

Test case should dynamically run only configurations that the machine's hardware can handle.

AdamLouly · 2024-09-11T18:22:46Z

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py

+    acc = tl.zeros((block_m, block_n), dtype=tl.float32)
+
+    for k_ in range(0, grid_k, step=2):
+        k_remaining = k - k_ * (block_k * split_k)
+
+        mask_a = offs_k[None, :] < k_remaining
+        mask_b = offs_k[:, None] < k_remaining
+
+        a = tl.load(a_ptrs, mask=mask_a, other=0.0)
+        b = tl.load(b_ptrs, mask=mask_b, other=0.0)
+
+        # fp8 input dot product (supported types: [fp8e4nv, fp8e5, fp8e4b15])
+        acc = tl.dot(a, b, acc)


you'd have to specify fp16 out_dtype for tl.dot as its fp32 by default.
This will make it work but make sure this does not impact numerical stability.

Suggested change

acc = tl.zeros((block_m, block_n), dtype=tl.float32)

for k_ in range(0, grid_k, step=2):

k_remaining = k - k_ * (block_k * split_k)

mask_a = offs_k[None, :] < k_remaining

mask_b = offs_k[:, None] < k_remaining

a = tl.load(a_ptrs, mask=mask_a, other=0.0)

b = tl.load(b_ptrs, mask=mask_b, other=0.0)

# fp8 input dot product (supported types: [fp8e4nv, fp8e5, fp8e4b15])

acc = tl.dot(a, b, acc)

acc = tl.zeros((block_m, block_n), dtype=tl.float16)

for k_ in range(0, grid_k, step=2):

k_remaining = k - k_ * (block_k * split_k)

mask_a = offs_k[None, :] < k_remaining

mask_b = offs_k[:, None] < k_remaining

a = tl.load(a_ptrs, mask=mask_a, other=0.0)

b = tl.load(b_ptrs, mask=mask_b, other=0.0)

# fp8 input dot product (supported types: [fp8e4nv, fp8e5, fp8e4b15])

acc = tl.dot(a, b, acc, out_dtype=tl.float16)

I tried this previously, and I don't think its supported. They also don't list it as a param in the documentation, so my guess is that it's designed to be unmodified.

…rnel into matmulfp8

unblock experimental feature

AndreSlavescu and others added 11 commits August 31, 2024 08:09

gemm fp8 e4m3

bb89933

update to benchmark

60f7ffd

faster fwd performance with tl.multiple_of

e11c22b

add stricter check for compute capability + exception handling

e68f7f1

Merge branch 'main' into matmulfp8

fafdfbe

perf improvement

91bf3dd

remove discrete functional api

9d467bf

make compute capability check a decorator

8b45800

format

2319fc7

implement backward kernel as well

7418433

add more benchmarks + diff utils

c39a7aa

lancerts requested a review from qingquansong September 4, 2024 01:29

lancerts reviewed Sep 4, 2024

View reviewed changes

qingquansong previously requested changes Sep 4, 2024

View reviewed changes

Merge branch 'main' of https://github.com/AndreSlavescu/Liger-Kernel …

f8829e5

…into matmulfp8 merge main into matmulfp8

yundai424 reviewed Sep 4, 2024

View reviewed changes

AndreSlavescu and others added 2 commits September 5, 2024 05:02

update utils to include mma_v3 for H100

032c4d9

Merge branch 'main' into matmulfp8

464cdd2

AndreSlavescu and others added 6 commits September 7, 2024 07:35

update test.

c8dba40

Merge branch 'matmulfp8' of https://github.com/AndreSlavescu/Liger-Ke…

ce2aee5

…rnel into matmulfp8

Merge branch 'main' into matmulfp8

cedf3de

format

bb2f725

Merge branch 'main' into matmulfp8

b3195da

Merge branch 'main' into matmulfp8

744642b

AdamLouly reviewed Sep 10, 2024

View reviewed changes

src/liger_kernel/ops/experimental/gemm_split_k_fp8_e4m3.py Outdated Show resolved Hide resolved

Merge branch 'main' into matmulfp8

7a8043a

AdamLouly reviewed Sep 11, 2024

View reviewed changes

AndreSlavescu and others added 9 commits September 12, 2024 03:00

compute types

9e60b0a

modify benchmark to be up to date

98b7abf

format

a709616

Merge branch 'main' into matmulfp8

c9cbc3a

fix mem bounds

569b4eb

Merge branch 'matmulfp8' of https://github.com/AndreSlavescu/Liger-Ke…

acc228b

…rnel into matmulfp8

docstring for fp8 gemm design

d31244a

format

0f36098

remove old benchmark format

edc4ebc

qingquansong self-requested a review September 13, 2024 20:39

Merge branch 'main' into matmulfp8

618c858

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemm fp8 e4m3 #185

gemm fp8 e4m3 #185

AndreSlavescu commented Aug 31, 2024 •

edited

Loading

lancerts Sep 4, 2024

qingquansong Sep 4, 2024

qingquansong Sep 4, 2024

qingquansong Sep 4, 2024 •

edited

Loading

qingquansong Sep 4, 2024

qingquansong commented Sep 4, 2024

yundai424 Sep 4, 2024

ByronHsu Sep 4, 2024

ByronHsu commented Sep 7, 2024

qingquansong commented Sep 7, 2024

AdamLouly Sep 11, 2024

AdamLouly Sep 11, 2024

AndreSlavescu Sep 12, 2024

		@@ -0,0 +1,302 @@
		# adapted from: https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/inference/fp8/splitk_gemm_fp8.py

gemm fp8 e4m3 #185

Are you sure you want to change the base?

gemm fp8 e4m3 #185

Conversation

AndreSlavescu commented Aug 31, 2024 • edited Loading

Summary

Testing Done

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingquansong Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingquansong commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ByronHsu commented Sep 7, 2024

qingquansong commented Sep 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndreSlavescu commented Aug 31, 2024 •

edited

Loading

qingquansong Sep 4, 2024 •

edited

Loading