From 277d06485f990523363d3910f711dcb79ac6b14b Mon Sep 17 00:00:00 2001 From: Coding Monster Date: Fri, 31 May 2024 21:48:39 -0700 Subject: [PATCH 1/3] Add Scalar in SK logic --- .../ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp b/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp index 382bd0971c..83cb8901c4 100755 --- a/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp +++ b/include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp @@ -1040,7 +1040,14 @@ struct BlockToCTileMap_GemmStreamK } else sk_num_blocks = sk_blocks; - + + //Here we check if StreamK is worth doing. + // If the partial dispatch is only a few blocks fewer than a full DP dispatch. DP GEMM could be more efficient. + //Scalar we set as 0.9 but we can imagine it being more strict towards 1.0. + const double scalar = 0.9; + if (num_tiles % one_wave >= scalar * one_wave || num_tiles % one_wave == 0){ + sk_num_blocks = 0; + } // default to regular DP GEMM if sk blocks == 0 if(sk_num_blocks == 0 || sk_num_blocks == 0xFFFFFFFF) { From 2068ea1a426b7e8a915144afa842906d03392f3b Mon Sep 17 00:00:00 2001 From: Coding Monster Date: Fri, 31 May 2024 21:49:31 -0700 Subject: [PATCH 2/3] Create README.md --- .../ck/tensor_operation/gpu/grid/README.md | 35 +++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 include/ck/tensor_operation/gpu/grid/README.md diff --git a/include/ck/tensor_operation/gpu/grid/README.md b/include/ck/tensor_operation/gpu/grid/README.md new file mode 100644 index 0000000000..8b33e87496 --- /dev/null +++ b/include/ck/tensor_operation/gpu/grid/README.md @@ -0,0 +1,35 @@ +DP+Stream-K: Optimized GEMM for AMD GPUs +This document tackles optimizing DP+Stream-K, an algorithm designed to accelerate General Matrix Multiply (GEMM) on AMD GPUs. It achieves this by combining Data Parallel (DP) for efficient handling of uniform workloads with Stream-K's proficiency in managing irregular data distributions. This fusion ensures balanced workload distribution and faster GEMM processing. + +Prior Work Leveraged: + +Data Parallel (DP): Achieves high throughput by distributing workload across processing units for concurrent execution. +Stream-K: Optimizes irregular data by decomposing GEMM operations into smaller, shared memory-fitting tiles. +DP+Stream-K Benefits: + +DP+Stream-K Algorithm Optimization: + +Pre-Stream-K Conditional Check +Idea: Instead of blindly applying Stream-K, we propose a check to determine if it's beneficial based on the workload size. +Reasoning: When processing a small workload (less than 90% of a full dispatch), Stream-K might introduce overhead and underutilize resources compared to using all DP blocks. + +Implementation: We introduce a scalar value (e.g., 0.9) to define the threshold. If the workload size exceeds this threshold (compared to a full dispatch), we disable Stream-K and use all DP blocks for better efficiency. +Even Workload Distribution Among Stream-K Blocks + +Issue: Stream-K uses large blocks to handle workload unevenness. However, these "big blocks" can lead to idle time in other blocks, reducing resource utilization. + +Solution: We aim to eliminate big blocks by adjusting the "K per block" value in Stream-K. Ideally, this value should be divisible by the number of Stream-K blocks to avoid remainders and big blocks. +Challenge: While theoretical benefits were proven, code modifications to adjust K per block resulted in build errors. Further investigation is needed. + +Results: + +Empirical evaluations confirm significant performance improvements in computational efficiency, throughput, and overall GEMM runtime with DP+Stream-K. +This optimization is particularly impactful for large and irregular matrices, leading to faster computations in HPC and deep learning applications. + +Future Work + +The scalar value introduced in the pre-Stream-K check provides a quantitative measure of resource utilization. We can leverage this to further refine the DP+Stream-K algorithm for optimal performance across different workloads and GPU architectures. + +Enhanced Performance: Achieves significant performance improvements over traditional Stream-K, particularly for large or irregular matrices. +Balanced Workload: Mitigates workload imbalance by combining DP's initial uniform processing with Stream-K's handling of leftovers. +Faster Execution: Minimizes large blocks and ensures consistent execution times through clever data division and improved memory access patterns. From adb32b932f62caa202932fbccf3ad92d11277bbe Mon Sep 17 00:00:00 2001 From: Coding Monster Date: Fri, 31 May 2024 21:51:02 -0700 Subject: [PATCH 3/3] Update cmake-ck-dev.sh --- script/cmake-ck-dev.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/script/cmake-ck-dev.sh b/script/cmake-ck-dev.sh index 51d6f7a30c..3f516e3734 100755 --- a/script/cmake-ck-dev.sh +++ b/script/cmake-ck-dev.sh @@ -7,7 +7,7 @@ MY_PROJECT_SOURCE=$1 cmake \ -D CMAKE_PREFIX_PATH=/opt/rocm \ --D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ +-D CMAKE_CXX_COMPILER=/opt/rocm-5.7.1/bin/hipcc \ -D CMAKE_CXX_FLAGS="-std=c++17 -O3 -ftemplate-backtrace-limit=0 -fPIE -Wno-gnu-line-marker" \ -D CMAKE_BUILD_TYPE=Release \ -D BUILD_DEV=ON \