Add Layernorm kernel #641

rahulbatra85 · 2024-09-13T19:24:16Z

No description provided.

brunomazzottiamd

@rahulbatra85, I can't see anything wrong with your PR. I've just have some questions and minor code cleanup suggestions. Feel free to ignore them if you judge appropriate.

Please drop a short line about this new kernel to python/perf-kernels/README.md file.

brunomazzottiamd · 2024-09-18T14:13:04Z

.github/workflows/amd_perf_kernel_Integration_tests.yml

@@ -128,8 +128,10 @@ jobs:
          pytest -vvv ./python/perf-kernels/flash-attention.py
          pytest -vvvv ./python/perf-kernels/softmax.py
          pytest -vvv ./python/perf-kernels/rmsnorm.py
+          pytest -vvv ./python/perf-kernels/layernorm.py


What do you think about running all tests with just one pytest invocation? According to https://docs.pytest.org/en/stable/how-to/usage.html, it's possible to do something like pytest -vvvv ./python/perf-kernels. By this way, we'll be editing .github/workflows/amd_perf_kernel_Integration_tests.yml less often and new tests are going to run by default. Do you see any drawback?

Maybe it's worth asking @micmelesse's opinion on this.

yeah that's a Michael question

Let's wait for Michael's opinion!

that is fine. I think some of the tests are broken but maybe worth it to see the state of things

python/perf-kernels/layernorm.py

brunomazzottiamd · 2024-09-18T15:18:33Z

python/perf-kernels/layernorm.py

+        y = x_hat * w + b
+        # Write output
+        tl.store(Y + cols, y, mask=mask)
+


Just an idea:
We have three for loops that do masked loads. Do you foresee any benefit of peeling the last iteration of each loop so all iterations except the last one do unmasked loads? I think Shucai and Xiaohu got some performance improvements doing this with GEMMs. I'm not sure if the idea could be beneficial for layer norm.

ok, yeah, I didn't think of that. Will try this out

Please let me know if this helped at all.

python/perf-kernels/layernorm.py

scxiao · 2024-09-23T04:47:11Z

python/perf-kernels/layernorm.py

+@triton.autotune(configs=get_autotune_config(), key=['n_rows', 'n_cols'], use_cuda_graph=True)
+@triton.jit
+def layernorm_kernel(x_ptr, y_ptr, w_ptr, b_ptr, x_row_stride, y_row_stride, n_rows, n_cols, eps,
+                     BLOCK_SIZE: tl.constexpr):


Can you add an input use_mask: tl.constexpr to the kernel? Then in the implementation, the read from global memory can be like:

loop_num = tl.cdiv(n_cols, BLOCK_SIZE) if use_mask: loop_num -= 1 #calculate mean mean = 0 _mean = tl.zeros([BLOCK_SIZE], dtype=tl.float32) for b in range(0, loop_num): col_offsets = b * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) x_block = tl.load(x_ptr_start + col_offsets).to(tl.float32) _mean += x_block if use_mask: col_offsets = loop_num * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) x_block = tl.load(x_ptr_start + col_offsets, mask=col_offsets < n_cols, other=0.).to(tl.float32) _mean += x_block mean = tl.sum(_mean, axis=0) / n_cols #same for the variance calculation. _var = tl.zeros([BLOCK_SIZE], dtype=tl.float32) for b in range(0, n_cols, BLOCK_SIZE): col_offsets = b + tl.arange(0, BLOCK_SIZE) x_block = tl.load(x_ptr_start + col_offsets, mask=col_offsets < n_cols, other=0.).to(tl.float32) x_block = tl.where(col_offsets < n_cols, x_block - mean, 0.) _var += x_block * x_block var = tl.sum(_var, axis=0) / n_cols rstd = tl.rsqrt(var + eps)

In this way, we do need mask in most of the iterations, which can make the load input be more efficient.

ok, will try this out.

I did this and the perf improves https://github.com/ROCm/triton-internal/issues/126#issuecomment-2369175077

rahulbatra85 force-pushed the main_perf-layernorm branch from b772444 to 674d526 Compare September 16, 2024 15:44

rahulbatra85 requested review from brunomazzottiamd, micmelesse and jtang10 September 16, 2024 15:45

brunomazzottiamd requested changes Sep 18, 2024

View reviewed changes

rahulbatra85 force-pushed the main_perf-layernorm branch from d88abbc to 13c01c4 Compare September 18, 2024 16:19

brunomazzottiamd self-requested a review September 18, 2024 17:05

brunomazzottiamd assigned brunomazzottiamd and rahulbatra85 and unassigned brunomazzottiamd Sep 18, 2024

brunomazzottiamd approved these changes Sep 18, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

Add Layernorm kernel

042aa91

rahulbatra85 force-pushed the main_perf-layernorm branch from 13c01c4 to 042aa91 Compare September 19, 2024 01:52

scxiao reviewed Sep 23, 2024

View reviewed changes

Add use mask

ccb3538

rahulbatra85 force-pushed the main_perf-layernorm branch from e389075 to ccb3538 Compare September 24, 2024 16:19

rahulbatra85 merged commit e13fc4c into main_perf Sep 24, 2024
4 checks passed

rahulbatra85 deleted the main_perf-layernorm branch September 24, 2024 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Layernorm kernel #641

Add Layernorm kernel #641

rahulbatra85 commented Sep 13, 2024

brunomazzottiamd left a comment

brunomazzottiamd Sep 18, 2024

rahulbatra85 Sep 18, 2024

brunomazzottiamd Sep 18, 2024

micmelesse Sep 18, 2024 •

edited

Loading

brunomazzottiamd Sep 18, 2024

rahulbatra85 Sep 18, 2024

brunomazzottiamd Sep 18, 2024

This comment was marked as resolved.

scxiao Sep 23, 2024 •

edited

Loading

rahulbatra85 Sep 23, 2024

rahulbatra85 Sep 24, 2024

Add Layernorm kernel #641

Add Layernorm kernel #641

Conversation

rahulbatra85 commented Sep 13, 2024

brunomazzottiamd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micmelesse Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

scxiao Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micmelesse Sep 18, 2024 •

edited

Loading

scxiao Sep 23, 2024 •

edited

Loading