-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final Project PR for StreamK #1318
Open
abbyoutsider
wants to merge
3
commits into
ROCm:ck_streamk_2tile_sk_dp
Choose a base branch
from
abbyoutsider:ck_streamk_2tile_sk_dp
base: ck_streamk_2tile_sk_dp
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+44
−2
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
DP+Stream-K: Optimized GEMM for AMD GPUs | ||
This document tackles optimizing DP+Stream-K, an algorithm designed to accelerate General Matrix Multiply (GEMM) on AMD GPUs. It achieves this by combining Data Parallel (DP) for efficient handling of uniform workloads with Stream-K's proficiency in managing irregular data distributions. This fusion ensures balanced workload distribution and faster GEMM processing. | ||
|
||
Prior Work Leveraged: | ||
|
||
Data Parallel (DP): Achieves high throughput by distributing workload across processing units for concurrent execution. | ||
Stream-K: Optimizes irregular data by decomposing GEMM operations into smaller, shared memory-fitting tiles. | ||
DP+Stream-K Benefits: | ||
|
||
DP+Stream-K Algorithm Optimization: | ||
|
||
Pre-Stream-K Conditional Check | ||
Idea: Instead of blindly applying Stream-K, we propose a check to determine if it's beneficial based on the workload size. | ||
Reasoning: When processing a small workload (less than 90% of a full dispatch), Stream-K might introduce overhead and underutilize resources compared to using all DP blocks. | ||
|
||
Implementation: We introduce a scalar value (e.g., 0.9) to define the threshold. If the workload size exceeds this threshold (compared to a full dispatch), we disable Stream-K and use all DP blocks for better efficiency. | ||
Even Workload Distribution Among Stream-K Blocks | ||
|
||
Issue: Stream-K uses large blocks to handle workload unevenness. However, these "big blocks" can lead to idle time in other blocks, reducing resource utilization. | ||
|
||
Solution: We aim to eliminate big blocks by adjusting the "K per block" value in Stream-K. Ideally, this value should be divisible by the number of Stream-K blocks to avoid remainders and big blocks. | ||
Challenge: While theoretical benefits were proven, code modifications to adjust K per block resulted in build errors. Further investigation is needed. | ||
|
||
Results: | ||
|
||
Empirical evaluations confirm significant performance improvements in computational efficiency, throughput, and overall GEMM runtime with DP+Stream-K. | ||
This optimization is particularly impactful for large and irregular matrices, leading to faster computations in HPC and deep learning applications. | ||
|
||
Future Work | ||
|
||
The scalar value introduced in the pre-Stream-K check provides a quantitative measure of resource utilization. We can leverage this to further refine the DP+Stream-K algorithm for optimal performance across different workloads and GPU architectures. | ||
|
||
Enhanced Performance: Achieves significant performance improvements over traditional Stream-K, particularly for large or irregular matrices. | ||
Balanced Workload: Mitigates workload imbalance by combining DP's initial uniform processing with Stream-K's handling of leftovers. | ||
Faster Execution: Minimizes large blocks and ensures consistent execution times through clever data division and improved memory access patterns. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please not change this to
/opt/rocm-xxx
, since this script will be used by other developers, with different rocm version. If need a specific rocm version, or on your environment you don't have/opt/rocm/
but onlyopt/rocm-5.7.1
, you can have a soft link to/opt/rocm/
by cmd e.g,ln -s /opt/rocm-5.7.1 /opt/rocm