Skip to content

Commit

Permalink
[HGEMM] Release toy-hgemm library 0.1.0 (#145)
Browse files Browse the repository at this point in the history
* Refactor HGEMM codes

* Refactor HGEMM codes

* Refactor HGEMM codes

* Create utils.py

* Update hgemm.py

* Update setup.py

* Update hgemm.cc

* Update utils.py

* Update setup.py

* Create clear.sh

* Update setup.py

* Update utils.py

* Update hgemm.py

* Update utils.py

* Delete hgemm/utils.py

* Create utils.py

* Update utils.py

* Create clear.sh

* Create install.sh

* Delete hgemm/clear.sh

* Update hgemm.py

* Update utils.py

* Update setup.py

* Update README.md

* Update utils.py

* Update setup.py

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md
  • Loading branch information
DefTruth authored Nov 22, 2024
1 parent 60d4ad2 commit 6ea2eb9
Show file tree
Hide file tree
Showing 24 changed files with 459 additions and 287 deletions.
43 changes: 22 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

<div id="contents"></div>

📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖HGEMM/SGEMM🔥🔥](#hgemm-sgemm) which has been fully optimized, check [📖HGEMM/SGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖toy-hgemm library🔥🔥](./hgemm) which can achieve the performance of **cuBLAS**, check [📖HGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉

<div id="hgemm-sgemm"></div>

Expand All @@ -25,7 +25,7 @@
<img src='https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85' height="225px" width="403px">
</div>

Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [hgemm benchmark](./hgemm) for more details.
Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./hgemm) for more details.

|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
|:---:|:---:|:---:|:---:|
Expand Down Expand Up @@ -202,26 +202,27 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d
| ✔️ [sgemm_t_8x8_sliced_k16...async](./sgemm/sgemm_async.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
| ✔️ [sgemm_wmma_m16n16k8...stages*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
| ✔️ [sgemm_wmma_m16n16k8...swizzle*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_naive_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️|
| ✔️ [hgemm_sliced_k_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_naive_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️|
| ✔️ [hgemm_sliced_k_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k_f16x4](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
| ✔️ [hgemm_mma_cublas*](./hgemm/cublas/hgemm_cublas.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️|
| ✔️ [sgemv_k32_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
| ✔️ [sgemv_k128_f32x4](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
| ✔️ [sgemv_k16_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
Expand Down
6 changes: 6 additions & 0 deletions hgemm/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,9 @@ __pycache__
*.engine
*.bin
*.out
*bin
bin
output
*.egg-info
*.whl
dist
15 changes: 10 additions & 5 deletions hgemm/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# HGEMM

## HGEMM/SGEMM Supported Matrix
# 🔥🔥Toy-HGEMM Library: Achieve the performance of cuBLAS

|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
|:---:|:---:|:---:|:---:|
Expand Down Expand Up @@ -45,6 +43,13 @@

</details>

## 安装
本仓库实现的HGEMM CUDA kernels可以作为一个python库toy-hgemm使用,安装命令如下。(可选)
```bash
git submodule update --init --recursive --force
bash tools/install.sh # pip uninstall toy-hgemm 卸载
```

## 测试命令

**CUTLASS**: 更新CUTLASS依赖库
Expand Down Expand Up @@ -154,7 +159,7 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot

在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。

![](./NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)
![](./bench/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)

```bash
python3 hgemm.py --wmma-all --plot
Expand All @@ -175,7 +180,7 @@ sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize96x64x32_stage3_warpsize2x2x1_tens
```
因此,只有实现使用Tensor Cores的HGEMM,才有可能接近PyTorch/cuBLAS的性能。
```bash
ncu -o hgemm.prof -f python3 prof.py
ncu -o hgemm.prof -f python3 bench/prof.py
nsys profile --stats=true -t cuda,osrt,nvtx -o hgemm.prof --force-overwrite true python3 prof.py
```
- SASS (L20)
Expand Down
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 6ea2eb9

Please sign in to comment.